Module 11: "Synchronization"
  Lecture 22: "Scalable Locking Primitives"
 

Traffic of test & set

  • In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported
    • every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later)
  • Let us assume that the lock location is cacheable and is kept coherent
    • Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr?
    • Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock
  • Can we improve this?
    • Test & set with backoff

Backoff test & set

  • Instead of retrying immediately wait for a while
    • How long to wait?
    • Waiting for too long may lead to long latency and lost opportunity
    • Constant and variable backoff
    • Special kind of variable backoff: exponential backoff (after the i th attempt the delay is k*ci where k and c are constants)
    • Test & set with exponential backoff works pretty well

                 delay = k
      Lock:   ts register, lock_addr
                 bez register, Enter_CS
                 pause (delay)                /* Can be simulated as a timed loop */
                 delay = delay*c
                 j Lock

Test & test & set

  • Reduce traffic further
    • Before trying test & set make sure that the lock is free

      Lock:  ts register, lock_addr
                bez register, Enter_CS
      Test:   lw register, lock_addr
                bnez register, Test
                j Lock

  • How good is it?
    • In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line
    • If the location is zero then only everyone will try test & set

TTS traffic analysis

  • Recall that unlock is always a simple store
  • In the worst case everyone will try to enter the CS at the same time
    • First time P transactions for ts and one succeeds; every other processor suffers a miss on the load in Test loop; then loops from cache
    • The lock-holder when unlocking generates an upgrade (why?) and invalidates all others
    • All other processors suffer read miss and get value zero now; so they break Test loop and try ts and the process continues until everyone has visited the CS

      (P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+… = (3P-1) + (3P-4) + (3P-7) + … ~ 1.5P2 asymptotically

    • For distributed shared memory the situation is worse because each invalidation becomes a separate message (more later)