Course:CICS525/Notes/CaseStudy-1
< Course:CICS525 | Notes
Dynamo: Overview
- This is basically a database
- But not your conventional database
- Conventional (relational) database:
- Data organized in tables
- Primary and secondary keys
- Tables sorted by primary/secondary keys
- Designed to answer any imaginable query
- Does not scale to thousands of nodes
- Difficult to replicate
- Amazon’s Dynamo
- Access by primary key only
ACID properties
- Atomicity – yes
- Updates are atomic by definition
- There are no transactions
- Consistency – no
- Data is eventually consistent
- Loose consistency is tolerated
- Reconciliation is performed by the client
- Isolation
- No isolation – one update at a time
- Durability – yes
- Durability is provided via replication
Availability
- Good service time is key for Amazon
- Not good when a credit card transaction times out
- Service-level agreement: the client’s response must be answered within 300ms
- Must provide this service for 99.9% of transactions at the load of 500 requests/second
Service-level agreements and their implications
- Loose consistency
- Synchronous replica reconciliation during the request cannot be done
- Contact a few replicas, if some do not reply, request is considered failed
- When to resolve conflicting updates? During reads or during writes?
- Usually resolved during writes
- Dynamo resolves it during reads
- Motivation: must have an always writable data store (cannot lose customer shopping card data)
System interface
get ( key )
- Locate object replicas
- Return:
- A single object
- A list of objects with conflicting versions
- Context (opaque information about object versioning)
put (key, value, context)
- Determines where the replicas should be placed
- Writes them to disk
Architectural considerations
- Partitioning
- Replication
- Versioning
- Membership
- Failure Handling
- Scaling
Partitioning
Consistent hashing
- How to partition data among nodes?
- Use consistent hashing
- Output of the hash maps to a circular space
- The largest hash value wraps to the smallest hash value
- Each node is assigned a random value in the space
- This represents its position in the ring
Assigning a key to a node
- Hash the key
- Find the node with the corresponding ring position
- Walk the ring clockwise to find the first node with the greater position than that of the key
- Similar search algorithms are used in distributed hash tables
Problems with consistent hashing
- May lead to imbalance in load distribution
- Solution:
- Each node is a virtual node
- Assign multiple virtual nodes to one physical node
Replication
- Each node has a coordinator (the node determined by the hash)
- The coordinator hashes the node at N other replicas
- N replicas that are next to the coordinator node in the ring in the clockwise fashion
- Virtual nodes are skipped to ensure that replicas are located on different physical nodes
Versioning
- Dynamo stores multiple versions of each data item
- Each update creates a new immutable version of the data item
- Versions are reconciled
- By the system
- By the client
- Versioning is achieved using vector clocks
Request routing
- Through a generic load balancer
- May forward request to a node who is NOT a coordinator
- The recipient node will forward the request to the coordinator
- Through a partition-aware client library that directly selects a coordinator
Consistency
Quorums
- Dynamo is configured with two parameters: R and W
- R is the minimum number of nodes who participate in the successful Read operation
- W is the minimum number of nodes who participate in the successful Write operation
- Request handling protocol (for writes):
- Coordinator receives request
- Coordinator computes vector clock and writes new version to disk
- Coordinator sends the new version and vector clock to the N replicas
- If at least W-1 respond, the request is successful
Sloppy quorum
- What if some of the N replicas are temporarily unavailable?
- This could limit system’s availability
- Cannot use strict quorum
- Use sloppy quorum
- If one of N replicas is unavailable, use another node that is not a replica
- That node will temporarily store the data
- Will forward it to the real replica when the replica is back up
Synchronizing replicas
- Uses Merkle trees
- Leaves are hashes of keys
- Can compare trees incrementally, without transferring the whole tree
- If a part the tree is not modified, the parent nodes’ hashes will be identical
- So parts of the tree can be compared without sending data between two replicas
- Only keys that are out of sync are transferred
Group management
Membership
- Membership is always explicit
- Nodes are added/removed by the operator
- So there is no need for “coordinator election”
- If a node is unavailable, this is considered temporary
- A node that starts up chooses a set of tokens (virtual nodes) and maps virtual nodes to physical nodes
- Membership information is eventually propagated via gossip protocol
- Mapping is made persistent on disk
Preventing logical partitions
- A new node may be unaware of other nodes before memberships are propagated
- If several such nodes are added simultaneously, we may have a logical partition
- Partitions are prevented using seed nodes
- Seed nodes are obtained from a static source, and they are known to everyone
- Memberships are propagated to everyone via seed nodes
Failure detection
- Failure discovery is local
- Node A discovers that Node B has failed if Node B does not respond
- Failures (like memberships) are propagated via gossip protocol
References
- Eventual consistency
- Amazon's Dynamo
- Google File System (original article that appeared in SOSP 2003)
- GFS
- GFS evolution
- Programming futures for cloud computing (extracted from a talk by Joe Hellerstein)