Course:CICS525/Notes/Consistency-2

Outline

Lazy release consistency
TreadMarks

Review: what makes a good consistency model?

Model is a contract between memory system and programmer
Programmer follows some rules about reads and writes
- Model provides guarantees
- Model embodies a tradeoff
Intuitive for programmer vs. Can be implemented efficiently

Treadmarks high level goals

Better DSM performance
Run existing parallel code

What specific problems with previous DSM are they trying to fix?

false sharing of different variables on same page
only one writer of a page at a time
write of irrelevant data may invalidate my page
send less data over the network
Big idea: write diffs
- don't send whole page, just send modified locations
- benefit: less network traffic
- can be made to work by themselves
- on first write:

tell other hosts to invalidate but keep hidden copy
owner makes hidden copy as well

- when other CPU reads:

owner "diffs" current page against hidden copy
send diffs over network
other CPU applies diffs to its hidden copy

- but most useful when combined with...

Big idea: release consistency

only send out write diffs on release of a lock, to all readers
this is a new consistency model!
CPU0 won't see CPU1's writes until CPU1 does a release
so machines can temporarily disagree on memory contents
benefit?
- multiple machines can have writeable copies of a page
- i.e. no bouncing of pages due to false sharing
- i.e. no invalidates
relies on write diffs
otherwise can't reconcile concurrent writes to same page

Big idea: lazy release consistency

only fetch write diffs on acquire of a lock
only fetch from previous holder of that lock
i.e., nothing happens at time of write or release
LRC is yet another new consistency model!
- LRC hides some writes that RC hides
- benefit?

less network traffic
some machines don't see updates at all, even to pages they cache
if you don't acquire lock on object, you don't see updates to it

- Example 1 (false sharing)
  - x and y are on the same page.

CPU0: a1 for(...) x++ r1
CPU1: a2 for(...) y++ r2  a1 print x, y r1

- What does Treadmarks do?
  - CPU0 and CPU1 both get cached writeable copy of the page
  - when they release, each computes diffs against original page
  - CPU1's a1 causes it to pull write diffs from last holder of lock 1
  - so CPU1 updates x in its page

How does RC improve performance?
- semantics allow two CPU0 to write w/o notifying each other
- thus can write same page at same time
- page does not bounce back and forth

How do write diffs improve performance?
- allow merge of writes to different part of same page
- less network traffic

Example 2 (LRC)
- x and y on same page

CPU0: a1 x=1 r1
CPU1:           a2 y++ r2
CPU2                      a1 print x r1

- What does Treadmarks do?
  - CPU2 only asks previous holder of lock 1 for write diffs
  - CPU2 does not see CPU1's modification to y, even tho on same page

Example 3 (motivate vector timestamps)

CPU0: a1 x=1 r1
CPU1:             a1 y=x r1
CPU2:                       a1 print x, y r1

What's the "right" answer?
- we need to define what LRC guarantees!
- answer: when you acquire a lock,
- you see all modifications by previous holder
- and all modifications previous holder saw

What does TreadMarks do?
- CPU2 and CPU1 need to decide what CPU2 needs and doesn't already have
- uses "vector timestamps"
- each CPU numbers its releases (i.e. write diffs)
- CPU1 tells CPU2:
  - at release, had seen CPU0's writes through #20, &c
  - 0: 20
  - 1: 25
  - 2: 19
  - 3: 36
  - ...
- this is a "vector timestamp"
  - CPU2 remembers a vector timestamp of writes it has seen
  - CPU2 compares w/ CPU1's VT to see what writes it needs from other CPUs

Q: this scheme implies that if CPU1 has seen CPU0's 20th write, CPU1 must also have seen all previous CPU0 writes. is that required by LRC model?
- yes, for writes that might have affected computation of 20th write.
- maybe not for writes that didn't.

VTs order writes to same variable by different CPUs:

CPU0: a1 x=1 r1  a2 y=9 r2
CPU1:              a1 x=2 r1
CPU2:                           a1 a2 z = x + y r2 r1

CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1.
How does CPU2 know what to do?
Could the VTs for two values of the same variable not be ordered?

CPU0: a1 x=1 r1
CPU1:              a2 x=2 r2
CPU2:                           a1 a2 print x r2 r1

What rules does the programmer have to follow?
What can the programmer expect from the memory system?

must lock around all writes (to order writes to same var)
to read latest value, must lock
if no lock for read, guaranteed to see values that contributed to the variables you did lock

Example of when LRC might work too hard:

CPU0: a2 z=99 r2  a1 x=1 r1
CPU1:                            a1 y=x r1

- TreadMarks will send z to CPU1, because it comes before x=1 in VT order.
- Assuming x and z are on the same page.
- Even if on different pages, CPU1 must invalidate z's page.
- But CPU1 doesn't use z.
- How could a system understand that z isn't needed?
- Compiler figures out what locations the program reads?
- CPU1 only asks about variables it will use.
- Relax the causal part of the LRC model?
- So you must lock all data you read.

Q: could TreadMarks work without using VM page protection?

mark all r/o to help find out what pages to generate diffs for
mark invalid to defer having to fetch diffs
neither is really crucial

- so TM doesn't depend on VM as much as IVY does

Performance
- Figure 3 (TreadMarks article) shows mostly good scaling
- is that the same as "good"?
- though apparently Water does lots of locking / sharing
- How close are they to best possible performance?
- maybe Figure 5 implies there is only about 20% fat to be cut

Does LRC beat previous DSM schemes?

they only compare against their own straw-man ERC
not against best known prior work
Figure 9 (TreadMarks article) suggests not much win, even for Water

Has DSM been successful?

clusters of cooperating machines are hugely successful
DSM not so much
- main justification is transparency for existing threaded code
- that's not interesting for new apps
- and transparency makes it hard to get high performance
MapReduce, DryadLINQ, or storage+messages more common than DSM

Outline

Review: what makes a good consistency model?

Treadmarks high level goals

What specific problems with previous DSM are they trying to fix?

Big idea: release consistency

Big idea: lazy release consistency

Does LRC beat previous DSM schemes?

Has DSM been successful?

References