Module 11: "Synchronization"
  Lecture 22: "Scalable Locking Primitives"
 

Goals of a lock algorithm

  • Low latency: if no contender the lock should be acquired fast
  • Low traffic: worst case lock acquire traffic should be low; otherwise it may affect unrelated transactions
  • Scalability: Traffic and latency should scale slowly with the number of processors
  • Low storage cost: Maintaining lock states should not impose unrealistic memory overhead
  • Fairness: Ideally processors should enter CS according to the order of lock request (TS or TTS does not guarantee this)

Ticket lock

  • Similar to Bakery algorithm but simpler
  • A nice application of fetch & inc
  • Basic idea is to come and hold a unique ticket and wait until your turn comes
    • Bakery algorithm failed to offer this uniqueness thereby increasing complexity

      Shared:   ticket = 0, release_count = 0;
      Lock:      fetch & inc reg1, ticket_addr
      Wait:      lw reg2, release_count_addr           /* while (release_count != ticket); */
                    sub reg3, reg2, reg1
                    bnez reg3, Wait

      Unlock:  addi reg2, reg2, 0x1  /* release_count++ */
                   sw reg2, release_count_addr

  • Initial fetch & inc generates O(P) traffic on bus-based machines (may be worse in DSM depending on implementation of fetch & inc)
  • But the waiting algorithm still suffers from 0.5P2 messages asymptotically
    • Researchers have proposed proportional backoff i.e. in the wait loop put a delay proportional to the difference between ticket value and last read release_count
  • Latency and storage-wise better than Bakery
  • Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery)
  • Guaranteed fairness: the ticket value induces a FIFO queue

Array-based lock

  • Solves the O(P2) traffic problem
  • The idea is to have a bit vector (essentially a character array if boolean type is not supported)
  • Each processor comes and takes the next free index into the array via fetch & inc
  • Then each processor loops on its index location until it becomes set
  • On unlock a processor is responsible to set the next index location if someone is waiting
  • Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic
  • Disadvantage: storage overhead is O(P)
  • Performance concerns
    • Avoid false sharing: allocate each array location on a different cache line
    • Assume a cache line size of 128 bytes and a character array: allocate an array of size 128P bytes and use every 128th position in the array
    • For distributed shared memory the location a processor loops on may not be in its local memory: on acquire it must take a remote miss; allocate P pages and let each processor loop on one bit in a page? Too much wastage; better solution: MCS lock (Mellor-Crummey & Scott)
  • Correctness concerns
    • Make sure to handle corner cases such as determining if someone is waiting on the next location (this must be an atomic operation) while unlocking
    • Remember to reset your index location to zero while unlocking