Course:CICS525/Notes/Consistency-2

From UBC Wiki

Outline

  • Lazy release consistency
  • TreadMarks

Review: what makes a good consistency model?

  • Model is a contract between memory system and programmer
  • Programmer follows some rules about reads and writes
    • Model provides guarantees
    • Model embodies a tradeoff
  • Intuitive for programmer vs. Can be implemented efficiently

Treadmarks high level goals

  • Better DSM performance
  • Run existing parallel code

What specific problems with previous DSM are they trying to fix?

  • false sharing of different variables on same page
  • only one writer of a page at a time
  • write of irrelevant data may invalidate my page
  • send less data over the network
  • Big idea: write diffs
    • don't send whole page, just send modified locations
    • benefit: less network traffic
    • can be made to work by themselves
    • on first write:
  1. tell other hosts to invalidate but keep hidden copy
  2. owner makes hidden copy as well
    • when other CPU reads:
  1. owner "diffs" current page against hidden copy
  2. send diffs over network
  3. other CPU applies diffs to its hidden copy
    • but most useful when combined with...

Big idea: release consistency

  • only send out write diffs on release of a lock, to all readers
  • this is a new consistency model!
  • CPU0 won't see CPU1's writes until CPU1 does a release
  • so machines can temporarily disagree on memory contents
  • benefit?
    • multiple machines can have writeable copies of a page
    • i.e. no bouncing of pages due to false sharing
    • i.e. no invalidates
  • relies on write diffs
  • otherwise can't reconcile concurrent writes to same page

Big idea: lazy release consistency

  • only fetch write diffs on acquire of a lock
  • only fetch from previous holder of that lock
  • i.e., nothing happens at time of write or release
  • LRC is yet another new consistency model!
    • LRC hides some writes that RC hides
    • benefit?
  1. less network traffic
  2. some machines don't see updates at all, even to pages they cache
  3. if you don't acquire lock on object, you don't see updates to it
    • Example 1 (false sharing)
      • x and y are on the same page.
CPU0: a1 for(...) x++ r1
CPU1: a2 for(...) y++ r2  a1 print x, y r1
    • What does Treadmarks do?
      • CPU0 and CPU1 both get cached writeable copy of the page
      • when they release, each computes diffs against original page
      • CPU1's a1 causes it to pull write diffs from last holder of lock 1
      • so CPU1 updates x in its page
  • How does RC improve performance?
    • semantics allow two CPU0 to write w/o notifying each other
    • thus can write same page at same time
    • page does not bounce back and forth
  • How do write diffs improve performance?
    • allow merge of writes to different part of same page
    • less network traffic
  • Example 2 (LRC)
    • x and y on same page
CPU0: a1 x=1 r1
CPU1:           a2 y++ r2
CPU2                      a1 print x r1

    • What does Treadmarks do?
      • CPU2 only asks previous holder of lock 1 for write diffs
      • CPU2 does not see CPU1's modification to y, even tho on same page
  • Example 3 (motivate vector timestamps)
CPU0: a1 x=1 r1
CPU1:             a1 y=x r1
CPU2:                       a1 print x, y r1

  • What's the "right" answer?
    • we need to define what LRC guarantees!
    • answer: when you acquire a lock,
    • you see all modifications by previous holder
    • and all modifications previous holder saw
  • What does TreadMarks do?
    • CPU2 and CPU1 need to decide what CPU2 needs and doesn't already have
    • uses "vector timestamps"
    • each CPU numbers its releases (i.e. write diffs)
    • CPU1 tells CPU2:
      • at release, had seen CPU0's writes through #20, &c
      • 0: 20
      • 1: 25
      • 2: 19
      • 3: 36
      • ...
    • this is a "vector timestamp"
      • CPU2 remembers a vector timestamp of writes it has seen
      • CPU2 compares w/ CPU1's VT to see what writes it needs from other CPUs
  • Q: this scheme implies that if CPU1 has seen CPU0's 20th write, CPU1 must also have seen all previous CPU0 writes. is that required by LRC model?
    • yes, for writes that might have affected computation of 20th write.
    • maybe not for writes that didn't.
  • VTs order writes to same variable by different CPUs:
CPU0: a1 x=1 r1  a2 y=9 r2
CPU1:              a1 x=2 r1
CPU2:                           a1 a2 z = x + y r2 r1
  • CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1.
  • How does CPU2 know what to do?
  • Could the VTs for two values of the same variable not be ordered?
CPU0: a1 x=1 r1
CPU1:              a2 x=2 r2
CPU2:                           a1 a2 print x r2 r1

  • What rules does the programmer have to follow?
  • What can the programmer expect from the memory system?
  1. must lock around all writes (to order writes to same var)
  2. to read latest value, must lock
  3. if no lock for read, guaranteed to see values that contributed to the variables you did lock
  • Example of when LRC might work too hard:
CPU0: a2 z=99 r2  a1 x=1 r1
CPU1:                            a1 y=x r1
    • TreadMarks will send z to CPU1, because it comes before x=1 in VT order.
    • Assuming x and z are on the same page.
    • Even if on different pages, CPU1 must invalidate z's page.
    • But CPU1 doesn't use z.
    • How could a system understand that z isn't needed?
    • Compiler figures out what locations the program reads?
    • CPU1 only asks about variables it will use.
    • Relax the causal part of the LRC model?
    • So you must lock all data you read.
  • Q: could TreadMarks work without using VM page protection?
  1. mark all r/o to help find out what pages to generate diffs for
  2. mark invalid to defer having to fetch diffs
  3. neither is really crucial
    • so TM doesn't depend on VM as much as IVY does
  • Performance
    • Figure 3 (TreadMarks article) shows mostly good scaling
    • is that the same as "good"?
    • though apparently Water does lots of locking / sharing
    • How close are they to best possible performance?
    • maybe Figure 5 implies there is only about 20% fat to be cut

Does LRC beat previous DSM schemes?

  • they only compare against their own straw-man ERC
  • not against best known prior work
  • Figure 9 (TreadMarks article) suggests not much win, even for Water

Has DSM been successful?

  • clusters of cooperating machines are hugely successful
  • DSM not so much
    • main justification is transparency for existing threaded code
    • that's not interesting for new apps
    • and transparency makes it hard to get high performance
  • MapReduce, DryadLINQ, or storage+messages more common than DSM

References