Objectives_template

	Traffic of test & set In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later) Let us assume that the lock location is cacheable and is kept coherent Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Can we improve this? Test & set with backoff Backoff test & set Instead of retrying immediately wait for a while How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff: exponential backoff (after the i th attempt the delay is kci where k and c are constants) Test & set with exponential backoff works pretty well delay = k Lock: ts register, lock_addr bez register, Enter_CS pause (delay) / Can be simulated as a timed loop / delay = delayc j Lock Test & test & set Reduce traffic further Before trying test & set make sure that the lock is free Lock: ts register, lock_addr bez register, Enter_CS Test: lw register, lock_addr bnez register, Test j Lock How good is it? In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line If the location is zero then only everyone will try test & set TTS traffic analysis Recall that unlock is always a simple store In the worst case everyone will try to enter the CS at the same time First time P transactions for ts and one succeeds; every other processor suffers a miss on the load in Test loop; then loops from cache The lock-holder when unlocking generates an upgrade (why?) and invalidates all others All other processors suffer read miss and get value zero now; so they break Test loop and try ts and the process continues until everyone has visited the CS (P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+… = (3P-1) + (3P-4) + (3P-7) + … ~ 1.5P2 asymptotically For distributed shared memory the situation is worse because each invalidation becomes a separate message (more later)