FaRM (2015)
Why are distributed transactions important
- distributed transactions are used in distributed databases
- they are an abstraction which simplify programming and reasoning: a single machine that never fails and executes one transaction at a time in an order that is consistent with real time
What is FaRM for
- FaRM is a distributed computing platform which provides distributed transactions
- objects are dsitributed across machines, in a datacenter
- transactions can span any number of machines
How is this paper different from the original FaRM paper
- the authors make use of FaRM to provide distributed transactions that have the ACID properties, strict serializability, high peformance, availability and durability
- distributed systems work best when nodes don’t need to talk, but when they do talk, make it fast
What durability guarantee does FaRM provide
- all committed transactions are durable even if the entire cluster fails
- this is enabled by primary backup replication
How are the above attributes achieved
transaction and recovery protocols are designed to capitalize on datacenter hardware trends
- fast networking with RDMA
low cost non-volatile memory
- Batteries are placed in server racks as part of a distributed USP. In case of a failure, data in DRAM is written to SSDs which can then be subsequently recovered from.
- Not all failures can be recovered from eg. the OS still needs to be functional to write to SSD
plentiful memory
- is used to hold entire application data in memory with no reads/writes to disk needed unless there is a failure
- this improves performance by avoiding synchronous writes to SSD
- additional benefit: it improves lifetime of the SSD by reducing writes
this in turn exposes CPU bottlenecks which are mitigated by
reducing message counts
primary-backup replication
- much more efficient in terms of number of messages as compared to quorum based system
optimistic concurrency control
- as many objects can be read as needed
- writes are buffered in memory until commit time
- access objects only on primaries
- avoids locking of objects at backups
- +ve: improved performance due to reduced number of messages required
- -ve: performance becomes poor as the number of conflicting transactions increase due to the transactions aborting
- using one-sided RDMA reads and writes (instead of messages eg. RPC)
- exploiting parallelism
By achieving those attributes, what application compromises can be eliminated
- not supporting transactions
- weak consistency guarantees
- providing transactions only when all necessary data resides within a single machine. This requires data partitioning and complicates correctness reasoning
How is it used
- An application thread can start a transaction which leads it to become the coordinator
- At the end of execution, FaRM can be invoked in order to commit the transaction
API
// create transaction Tx* txCreate();
// allocate an object void txAlloc(Tx *t, int size, Addr a, Cont *c); void txFree(Tx *t, Addr a, Cont *c); void txRead(Tx *t, Addr a, int size, Cont *c); void txWrite(Tx *t, ObjBuf *old, ObjBuf *new);
// commit transaction void txCommit(Tx *t, Cont *c);
// lock-free read Lf* lockFreeStart(); void lockFreeRead(Lf* op,Addr a,int size,Cont *c); void lockFreeEnd(Lf *op);
Incarnation objGetIncarnation(ObjBuf *o); void objIncrementIncarnation(ObjBuf *o); void msgRegisterHandler(MsgId i, Cont *c); void msgSend(Addr a, MsgId i, Msg *m, Cont *c);
lock-free reads
single-object read only transactions that are optimized
locality hints (by passing address)
these improve performance by allowing programmers to allocate related objects on the same machines
function shipping
the process of moving computational logic onto a different machine for minimizing communication or load balancing
What phases does a transaction consist of
- execution phase
involves operations on mutable state (read, write etc) as well as arbitrary logic
- commit phase
- execution phase
How can transactions fail
- failures in execution of the transaction
- conflicts with concurrent transactions
Architecture
Each machine runs FaRM
What is the role of the transaction coordinator
it communicates directly with primaries and backups to commit a transaction eg. writing records at primaries and backups, aborting a transaction etc
Configuration
What are the roles of the Configuration Manager (CM)
- maintains a mapping of region ids to primary and backup machines
- allocates a new region when contacted to do so by another machine
- detects failures, manages leases and coordinates recovery
- invokes Zookeeper once per configuration change to update it
How is the FaRM configuration managed
Zookeeper is used, only to store the current configuration and ensure machines agree on it. Other functions are done more efficiently by the CM instead (managing leases, detecting failures, coordinate recovery)
How can the configuration change
- failure of a machine
- addition of a machine
How are configuration changes performed
the CM invokes Zookeeper once per configuration change to update it
Memory
All data is stored in memory as non-volatile DRAM is used
- +ve: access time is reduced
- -ve: amount of data is limited to that which can fit in memory (petabytes)
Three-Tiered Allocation
- slabs: used by threads for private allocation of small objects from blocks
- blocks: 1 mb blocks of memory, across a machine
regions
- contiguous 2GB blocks of memory, distributed across the cluster
- used to store objects
- must be registered with Network Interface Controller in order to be available for remote access
How are memory regions allocated
machines contact the CM. The CM selects replicas such that the number of regions stored on each machine is balanced, each replica is in a different failure domain and adheres to any locality constraints specified by the application
circular buffer queues
- persistent (due to non-volatile DRAM used)
- serve as store for transactional logs (write-ahead) as well as write operations
- senders append records at the tail of the receiver's log without involving its CPU (one-sided RDMA). The receiver periodically polls the head of the log to process them and lazily updates senders when truncating the log
How are objects read
- objects are always read from a region's primary copy
- using either local memory access or one-sided RDMA depending on where the object is located
What is guaranteed for object reads (during transaction execution)
- atomicity for individual object reads
- only committed data is read
- successive reads of the same object return the same data
- the latest value is returned
What isn't guaranteed for object reads
- atomicity of reads across different objects
- +ve : consistency checks can be deferred until commit time
- -ve : applications must handle temporary inconsistencies
Communication
- one-sided RDMA operations is used to bypass TCP/IP for efficient messaging
Network Interface Controller (NIC)
- has a message rate bottleneck
this is solved by using two Infiniband (56 Gbps) NICs
- its page table is too small
this is solved by using 2GB regions in memory
- has a message rate bottleneck
read operations
- received by NIC
- 2x improvement as half as many network packets required
- data is fetched from memory without the CPU being involved
write operations
- NIC issues DMA to circular buffer to write the data at its tail
- sender receives ACK from NIC without having to wait for CPU
- CPU polls buffer in order to process messages
Protocols
Commit
four-phased, does not use two-phase commit, unlike the original FaRM paper
four phase commit protocol (this paper)
C: coordinator, P_: primary, B_: backup
gears represent CPU processing
two phase commit protocol
C: coordinator, P_: primary, B_: backup
gears represent CPU processing
How does the four phase protocol improve on the two phase protocol
- avoids the need for 2 roundtrips
- reduces the number of messages required
- avoids cpu processing needed at both primaries and backups during both phases
Phases
What happens if any of the phases fail before the transaction is committed
the transaction is aborted
Lock
- the coordinator writes a LOCK record to primaries containing versions and new values of all written objects
- primaries attempt to lock the objects at a specified version
- primaries send back a message reporting whether locks were acquired
How can locking fail
- an object's version changed since it was read by the transaction
- the object is locked by another transaction
Validate
the coordinator obtains versions of all objects that were read (but not written) during the transaction, from primaries
How can validation fail
- if an object's version has changed
Commit backup
the coordinator writes a commit record to the logs at each backup (it receives an ACK from the backup's NIC without the backup's CPU being involved)
How can commit backup fail
- if any hardware failure occurs
Commit primary
- after all commit backups have been ACKed, the coordinator writes a record to the logs at each primary
- the coordinator reports completion to the application - at least one ACK for such a record was received (including itself)
- primaries process the records by updating the objects in place, incrementing their versions and unlocking them. This exposes the writes committed by the transaction
Truncate
- the coordinator lazily truncates logs at primaries and backups after receiving ACKs from all primaries
- backups apply updates to their copies of the objects at truncation time
Comparison with two phase commit protocol in Spanner
Recovery
What techniques are used to ensure high availability
- data is available quickly by having a backup promoted to primary (after lock recovery). External clients do not have to wait while region replicas are brought up to date
- short failure detection time (10 ms) is enabled by having short leases. This is achieved by a fast network and dedicated network queues, dedicated and highly prioritized thread for managing leases as well as memory pre-allocation
parallelism is used
- new transactions (on unaffected regions) occur in parallel with:
- transaction recovery
- data recovery (occurs after transaction recovery)
- recovery is parallelized across regions, machines and threads
- new transactions (on unaffected regions) occur in parallel with:
- work is minimized by only limiting it to transactions and regions that were affected by reconfiguration
What is a recovering transaction
- one whose commit phase spans configuration changes and for which either one of the following has changed:
- replica of a written object
- primary of a read object
- coordinator
- it must be agreed upon as such by all machines
- one whose commit phase spans configuration changes and for which either one of the following has changed:
How is strict serializability ensured through the course of transaction recovery
- recovery preserves the outcome of transactions that were previously committed or aborted
- if a recovering transaction was committed, its log record is guaranteed to be processed and accepted before the log draining step
Phases
Failure detection
- leases are granted using a 3-way handshake
- the expiry of a lease triggers failure recovery
- leases are prevented from expiring through heartbeat messages
Reconfiguration
- initiated when lease expires
- the new CM remaps regions previously assigned to failed machines
- all external client requests are blocked when applying a new configuration and unblocked after it has been committed
If the CM is suspected of failure, how are simultaneous reconfiguration attempts prevented
the non-CM machine first requests a backup CM to initiate reconfiguration. Only if the configuration remains unchanged after a timeout does the non-CM machine attempt reconfiguration itself
In case of a partition, how is the CM ensured to be in the larger partition
a new CM issues an RDMA read to all machines in the configuration (except the machine that is suspected of failure) and proceeds with reconfiguration only if it obtains a majority of responses
How can reconfiguration fail
- if their exist regions that have lost all their replicas
- there is no space to re-replicate regions
Transaction recovery
- if a primary fails, a backup is promoted to be the new primary
- logs are drained to ensure all relevant records are processed. Following this, messages from the old configuration are no longer processed
- the primary of each region then builds the complete set of recovering transactions that affect the region
- the transactions are sharded across threads in order to be recovered
- locks are acquired for objects modified by recovering transactions. Following this, objects can be read and updates can be committed to this region in parallel with subsequent recovery steps
Data recovery
- replication for new backups in order to be able to tolerate f replica failures in the future
- as it is not critical for normal operation, it is delayed in order to prioritize lock recovery which impacts latency
- each thread uses RDMA to read a block at a time from the primary
Further Questions
Suppose there are two FaRM transactions that both increment the same object. They start at the same time and see the same initial value for the object. One transaction completely finishes committing (see Section 4 and Figure 4). Then the second transaction starts to commit. There are no failures. What is the evidence that FaRM will use to realize that it must abort the second transaction? At what point in the Section 4 / Figure 4 protocol will FaRM realize that it must abort?
- FaRM will use the object's version number as evidence. Since both of them will have read the same value initially and one of the transaction has incremented the object's version (in the Commit primary phase), the other transaction will detect the changed version number in the Lock phase, after it receives the LOCK record from the coordinator. This will cause it to abort
- Why is RPC used instead of one-sided RDMA reads if the primary holds more than a certain number (4) of objects?
- How does the centralized approach to region allocation provide more flexibilty to satisfy failure independence and locality constraints (compared with consistent hashing)?
- Why is Zookeeper used for configuration consensus instead of vertical Paxos (as in the original FaRM paper)?
- In reconfiguration, how does precise membership prevent objects from being mutated until after lease expiry?
Why are objects that are written not validated in the validate phase of a transaction commit?
- they have already been validated in the Lock phase
- Why is coordinator state not replicated?
See Also
https://pdos.csail.mit.edu/6.824/notes/l-farm.txt
https://pdos.csail.mit.edu/6.824/papers/farm-faq.txt
https://blog.acolyer.org/2016/01/14/no-compromises/
Transaction Processing, Concepts and Techniques (Gray, Reuter)
References
https://pdos.csail.mit.edu/6.824/papers/farm-2015.pdf
https://pdos.csail.mit.edu/6.824/questions.html?q=q-farm&lec=14
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf
https://docs.oracle.com/cd/A87860_01/doc/appdev.817/a86030/adx16nt4.htm