Course:CICS525/Notes/Consistency-2
< Course:CICS525 | Notes
Outline
- Lazy release consistency
- TreadMarks
Review: what makes a good consistency model?
- Model is a contract between memory system and programmer
- Programmer follows some rules about reads and writes
- Model provides guarantees
- Model embodies a tradeoff
- Intuitive for programmer vs. Can be implemented efficiently
Treadmarks high level goals
- Better DSM performance
- Run existing parallel code
What specific problems with previous DSM are they trying to fix?
- false sharing of different variables on same page
- only one writer of a page at a time
- write of irrelevant data may invalidate my page
- send less data over the network
- Big idea: write diffs
- don't send whole page, just send modified locations
- benefit: less network traffic
- can be made to work by themselves
- on first write:
- tell other hosts to invalidate but keep hidden copy
- owner makes hidden copy as well
- when other CPU reads:
- owner "diffs" current page against hidden copy
- send diffs over network
- other CPU applies diffs to its hidden copy
- but most useful when combined with...
Big idea: release consistency
- only send out write diffs on release of a lock, to all readers
- this is a new consistency model!
- CPU0 won't see CPU1's writes until CPU1 does a release
- so machines can temporarily disagree on memory contents
- benefit?
- multiple machines can have writeable copies of a page
- i.e. no bouncing of pages due to false sharing
- i.e. no invalidates
- relies on write diffs
- otherwise can't reconcile concurrent writes to same page
Big idea: lazy release consistency
- only fetch write diffs on acquire of a lock
- only fetch from previous holder of that lock
- i.e., nothing happens at time of write or release
- LRC is yet another new consistency model!
- LRC hides some writes that RC hides
- benefit?
- less network traffic
- some machines don't see updates at all, even to pages they cache
- if you don't acquire lock on object, you don't see updates to it
- Example 1 (false sharing)
- x and y are on the same page.
- Example 1 (false sharing)
CPU0: a1 for(...) x++ r1 CPU1: a2 for(...) y++ r2 a1 print x, y r1
- What does Treadmarks do?
- CPU0 and CPU1 both get cached writeable copy of the page
- when they release, each computes diffs against original page
- CPU1's a1 causes it to pull write diffs from last holder of lock 1
- so CPU1 updates x in its page
- What does Treadmarks do?
- How does RC improve performance?
- semantics allow two CPU0 to write w/o notifying each other
- thus can write same page at same time
- page does not bounce back and forth
- How do write diffs improve performance?
- allow merge of writes to different part of same page
- less network traffic
- Example 2 (LRC)
- x and y on same page
CPU0: a1 x=1 r1 CPU1: a2 y++ r2 CPU2 a1 print x r1
- What does Treadmarks do?
- CPU2 only asks previous holder of lock 1 for write diffs
- CPU2 does not see CPU1's modification to y, even tho on same page
- What does Treadmarks do?
- Example 3 (motivate vector timestamps)
CPU0: a1 x=1 r1 CPU1: a1 y=x r1 CPU2: a1 print x, y r1
- What's the "right" answer?
- we need to define what LRC guarantees!
- answer: when you acquire a lock,
- you see all modifications by previous holder
- and all modifications previous holder saw
- What does TreadMarks do?
- CPU2 and CPU1 need to decide what CPU2 needs and doesn't already have
- uses "vector timestamps"
- each CPU numbers its releases (i.e. write diffs)
- CPU1 tells CPU2:
- at release, had seen CPU0's writes through #20, &c
- 0: 20
- 1: 25
- 2: 19
- 3: 36
- ...
- this is a "vector timestamp"
- CPU2 remembers a vector timestamp of writes it has seen
- CPU2 compares w/ CPU1's VT to see what writes it needs from other CPUs
- Q: this scheme implies that if CPU1 has seen CPU0's 20th write, CPU1 must also have seen all previous CPU0 writes. is that required by LRC model?
- yes, for writes that might have affected computation of 20th write.
- maybe not for writes that didn't.
- VTs order writes to same variable by different CPUs:
CPU0: a1 x=1 r1 a2 y=9 r2 CPU1: a1 x=2 r1 CPU2: a1 a2 z = x + y r2 r1
- CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1.
- How does CPU2 know what to do?
- Could the VTs for two values of the same variable not be ordered?
CPU0: a1 x=1 r1 CPU1: a2 x=2 r2 CPU2: a1 a2 print x r2 r1
- What rules does the programmer have to follow?
- What can the programmer expect from the memory system?
- must lock around all writes (to order writes to same var)
- to read latest value, must lock
- if no lock for read, guaranteed to see values that contributed to the variables you did lock
- Example of when LRC might work too hard:
CPU0: a2 z=99 r2 a1 x=1 r1 CPU1: a1 y=x r1
- TreadMarks will send z to CPU1, because it comes before x=1 in VT order.
- Assuming x and z are on the same page.
- Even if on different pages, CPU1 must invalidate z's page.
- But CPU1 doesn't use z.
- How could a system understand that z isn't needed?
- Compiler figures out what locations the program reads?
- CPU1 only asks about variables it will use.
- Relax the causal part of the LRC model?
- So you must lock all data you read.
- Q: could TreadMarks work without using VM page protection?
- mark all r/o to help find out what pages to generate diffs for
- mark invalid to defer having to fetch diffs
- neither is really crucial
- so TM doesn't depend on VM as much as IVY does
- Performance
- Figure 3 (TreadMarks article) shows mostly good scaling
- is that the same as "good"?
- though apparently Water does lots of locking / sharing
- How close are they to best possible performance?
- maybe Figure 5 implies there is only about 20% fat to be cut
Does LRC beat previous DSM schemes?
- they only compare against their own straw-man ERC
- not against best known prior work
- Figure 9 (TreadMarks article) suggests not much win, even for Water
Has DSM been successful?
- clusters of cooperating machines are hugely successful
- DSM not so much
- main justification is transparency for existing threaded code
- that's not interesting for new apps
- and transparency makes it hard to get high performance
- MapReduce, DryadLINQ, or storage+messages more common than DSM