Course:CICS525/Notes/CaseStudy-1

From UBC Wiki

Dynamo: Overview

  • This is basically a database
    • But not your conventional database
    • Conventional (relational) database:
      • Data organized in tables
      • Primary and secondary keys
      • Tables sorted by primary/secondary keys
      • Designed to answer any imaginable query
      • Does not scale to thousands of nodes
      • Difficult to replicate
  • Amazon’s Dynamo
    • Access by primary key only

ACID properties

  • Atomicity – yes
    • Updates are atomic by definition
    • There are no transactions
  • Consistency – no
    • Data is eventually consistent
    • Loose consistency is tolerated
    • Reconciliation is performed by the client
  • Isolation
    • No isolation – one update at a time
  • Durability – yes
    • Durability is provided via replication

Availability

  • Good service time is key for Amazon
  • Not good when a credit card transaction times out
  • Service-level agreement: the client’s response must be answered within 300ms
  • Must provide this service for 99.9% of transactions at the load of 500 requests/second

Service-level agreements and their implications

  • Loose consistency
  • Synchronous replica reconciliation during the request cannot be done
  • Contact a few replicas, if some do not reply, request is considered failed
  • When to resolve conflicting updates? During reads or during writes?
    • Usually resolved during writes
    • Dynamo resolves it during reads
    • Motivation: must have an always writable data store (cannot lose customer shopping card data)

System interface

get ( key )
  • Locate object replicas
  • Return:
    • A single object
    • A list of objects with conflicting versions
    • Context (opaque information about object versioning)
put (key, value, context) 
  • Determines where the replicas should be placed
  • Writes them to disk

Architectural considerations

  • Partitioning
  • Replication
  • Versioning
  • Membership
  • Failure Handling
  • Scaling

Partitioning

Consistent hashing

  • How to partition data among nodes?
  • Use consistent hashing
  • Output of the hash maps to a circular space
  • The largest hash value wraps to the smallest hash value
  • Each node is assigned a random value in the space
  • This represents its position in the ring

Assigning a key to a node

  • Hash the key
  • Find the node with the corresponding ring position
  • Walk the ring clockwise to find the first node with the greater position than that of the key
  • Similar search algorithms are used in distributed hash tables

Problems with consistent hashing

  • May lead to imbalance in load distribution
  • Solution:
    • Each node is a virtual node
    • Assign multiple virtual nodes to one physical node

Replication

  • Each node has a coordinator (the node determined by the hash)
  • The coordinator hashes the node at N other replicas
  • N replicas that are next to the coordinator node in the ring in the clockwise fashion
  • Virtual nodes are skipped to ensure that replicas are located on different physical nodes

Versioning

  • Dynamo stores multiple versions of each data item
  • Each update creates a new immutable version of the data item
  • Versions are reconciled
    • By the system
    • By the client
  • Versioning is achieved using vector clocks

Request routing

  • Through a generic load balancer
    • May forward request to a node who is NOT a coordinator
    • The recipient node will forward the request to the coordinator
  • Through a partition-aware client library that directly selects a coordinator

Consistency

Quorums

  • Dynamo is configured with two parameters: R and W
  • R is the minimum number of nodes who participate in the successful Read operation
  • W is the minimum number of nodes who participate in the successful Write operation
  • Request handling protocol (for writes):
    • Coordinator receives request
    • Coordinator computes vector clock and writes new version to disk
    • Coordinator sends the new version and vector clock to the N replicas
    • If at least W-1 respond, the request is successful

Sloppy quorum

  • What if some of the N replicas are temporarily unavailable?
    • This could limit system’s availability
    • Cannot use strict quorum
  • Use sloppy quorum
    • If one of N replicas is unavailable, use another node that is not a replica
    • That node will temporarily store the data
    • Will forward it to the real replica when the replica is back up

Synchronizing replicas

  • Uses Merkle trees
    • Leaves are hashes of keys
    • Can compare trees incrementally, without transferring the whole tree
  • If a part the tree is not modified, the parent nodes’ hashes will be identical
    • So parts of the tree can be compared without sending data between two replicas
    • Only keys that are out of sync are transferred

Group management

Membership

  • Membership is always explicit
  • Nodes are added/removed by the operator
  • So there is no need for “coordinator election”
  • If a node is unavailable, this is considered temporary
  • A node that starts up chooses a set of tokens (virtual nodes) and maps virtual nodes to physical nodes
  • Membership information is eventually propagated via gossip protocol
  • Mapping is made persistent on disk

Preventing logical partitions

  • A new node may be unaware of other nodes before memberships are propagated
  • If several such nodes are added simultaneously, we may have a logical partition
  • Partitions are prevented using seed nodes
  • Seed nodes are obtained from a static source, and they are known to everyone
  • Memberships are propagated to everyone via seed nodes

Failure detection

  • Failure discovery is local
  • Node A discovers that Node B has failed if Node B does not respond
  • Failures (like memberships) are propagated via gossip protocol

References