FaRM (2015)

20 March, 2020

Background Reading
https://pdos.csail.mit.edu/6.824/notes/l-2pc.txt
https://pdos.csail.mit.edu/6.824/papers/chapter9-faq.txt

Why are distributed transactions important
- distributed transactions are used in distributed databases
- they are an abstraction which simplify programming and reasoning: a single machine that never fails and executes one transaction at a time in an order that is consistent with real time

What is FaRM for
- FaRM is a distributed computing platform which provides distributed transactions
- objects are dsitributed across machines, in a datacenter
- transactions can span any number of machines

How is this paper different from the original FaRM paper
- the authors make use of FaRM to provide distributed transactions that have the ACID properties, strict serializability, high peformance, availability and durability
- distributed systems work best when nodes don’t need to talk, but when they do talk, make it fast

What durability guarantee does FaRM provide
- all committed transactions are durable even if the entire cluster fails
- this is enabled by primary backup replication

How are the above attributes achieved
- transaction and recovery protocols are designed to capitalize on datacenter hardware trends
  - fast networking with RDMA
  - low cost non-volatile memory
    Batteries are placed in server racks as part of a distributed USP. In case of a failure, data in DRAM is written to SSDs which can then be subsequently recovered from.
    Not all failures can be recovered from eg. the OS still needs to be functional to write to SSD
  - plentiful memory
    is used to hold entire application data in memory with no reads/writes to disk needed unless there is a failure
    this improves performance by avoiding synchronous writes to SSD
    additional benefit: it improves lifetime of the SSD by reducing writes
- this in turn exposes CPU bottlenecks which are mitigated by
  - reducing message counts
    primary-backup replication
    much more efficient in terms of number of messages as compared to quorum based system
    optimistic concurrency control
    as many objects can be read as needed
    writes are buffered in memory until commit time
    access objects only on primaries
    avoids locking of objects at backups
    +ve: improved performance due to reduced number of messages required
    -ve: performance becomes poor as the number of conflicting transactions increase due to the transactions aborting
  - using one-sided RDMA reads and writes (instead of messages eg. RPC)
  - exploiting parallelism

By achieving those attributes, what application compromises can be eliminated
- not supporting transactions
- weak consistency guarantees
- providing transactions only when all necessary data resides within a single machine. This requires data partitioning and complicates correctness reasoning

How is it used

An application thread can start a transaction which leads it to become the coordinator

At the end of execution, FaRM can be invoked in order to commit the transaction

API

// create transaction
Tx* txCreate();
// allocate an object
void txAlloc(Tx *t, int size, Addr a, Cont *c);
void txFree(Tx *t, Addr a, Cont *c);
void txRead(Tx *t, Addr a, int size, Cont *c);
void txWrite(Tx *t, ObjBuf *old, ObjBuf *new);
// commit transaction
void txCommit(Tx *t, Cont *c);
// lock-free read
Lf* lockFreeStart();
void lockFreeRead(Lf* op,Addr a,int size,Cont *c);
void lockFreeEnd(Lf *op);
Incarnation objGetIncarnation(ObjBuf *o);
void objIncrementIncarnation(ObjBuf *o);
void msgRegisterHandler(MsgId i, Cont *c);
void msgSend(Addr a, MsgId i, Msg *m, Cont *c);

lock-free reads
single-object read only transactions that are optimized

locality hints (by passing address)
these improve performance by allowing programmers to allocate related objects on the same machines

function shipping
the process of moving computational logic onto a different machine for minimizing communication or load balancing

What phases does a transaction consist of
- execution phase
  involves operations on mutable state (read, write etc) as well as arbitrary logic
- commit phase

How can transactions fail
- failures in execution of the transaction
- conflicts with concurrent transactions

Architecture
Each machine runs FaRM
- What is the role of the transaction coordinator
  it communicates directly with primaries and backups to commit a transaction eg. writing records at primaries and backups, aborting a transaction etc
- Configuration
  - What are the roles of the Configuration Manager (CM)
    maintains a mapping of region ids to primary and backup machines
    allocates a new region when contacted to do so by another machine
    detects failures, manages leases and coordinates recovery
    invokes Zookeeper once per configuration change to update it
  - How is the FaRM configuration managed
    Zookeeper is used, only to store the current configuration and ensure machines agree on it. Other functions are done more efficiently by the CM instead (managing leases, detecting failures, coordinate recovery)
  - How can the configuration change
    failure of a machine
    addition of a machine
  - How are configuration changes performed
    the CM invokes Zookeeper once per configuration change to update it
- Memory
  - All data is stored in memory as non-volatile DRAM is used
    +ve: access time is reduced
    -ve: amount of data is limited to that which can fit in memory (petabytes)
  - Three-Tiered Allocation
    slabs: used by threads for private allocation of small objects from blocks
    blocks: 1 mb blocks of memory, across a machine
    regions
    contiguous 2GB blocks of memory, distributed across the cluster
    used to store objects
    must be registered with Network Interface Controller in order to be available for remote access
  - How are memory regions allocated
    machines contact the CM. The CM selects replicas such that the number of regions stored on each machine is balanced, each replica is in a different failure domain and adheres to any locality constraints specified by the application
  - circular buffer queues
    persistent (due to non-volatile DRAM used)
    serve as store for transactional logs (write-ahead) as well as write operations
    senders append records at the tail of the receiver's log without involving its CPU (one-sided RDMA). The receiver periodically polls the head of the log to process them and lazily updates senders when truncating the log
  - How are objects read
    objects are always read from a region's primary copy
    using either local memory access or one-sided RDMA depending on where the object is located
  - What is guaranteed for object reads (during transaction execution)
    atomicity for individual object reads
    only committed data is read
    successive reads of the same object return the same data
    the latest value is returned
  - What isn't guaranteed for object reads
    atomicity of reads across different objects
    +ve : consistency checks can be deferred until commit time
    -ve : applications must handle temporary inconsistencies
- Communication
  - one-sided RDMA operations is used to bypass TCP/IP for efficient messaging
  - Network Interface Controller (NIC)
    has a message rate bottleneck
    this is solved by using two Infiniband (56 Gbps) NICs
    its page table is too small
    this is solved by using 2GB regions in memory
  - read operations
    received by NIC
    2x improvement as half as many network packets required
    data is fetched from memory without the CPU being involved
  - write operations
    NIC issues DMA to circular buffer to write the data at its tail
    sender receives ACK from NIC without having to wait for CPU
    CPU polls buffer in order to process messages

Protocols
- Commit
  - four-phased, does not use two-phase commit, unlike the original FaRM paper
    four phase commit protocol (this paper)
    C: coordinator, P_: primary, B_: backup
    gears represent CPU processing
    two phase commit protocol
    C: coordinator, P_: primary, B_: backup
    gears represent CPU processing
  - How does the four phase protocol improve on the two phase protocol
    avoids the need for 2 roundtrips
    reduces the number of messages required
    avoids cpu processing needed at both primaries and backups during both phases
  - Phases
    What happens if any of the phases fail before the transaction is committed
    the transaction is aborted
    Lock
    the coordinator writes a LOCK record to primaries containing versions and new values of all written objects
    primaries attempt to lock the objects at a specified version
    primaries send back a message reporting whether locks were acquired
    How can locking fail
    an object's version changed since it was read by the transaction
    the object is locked by another transaction
    Validate
    the coordinator obtains versions of all objects that were read (but not written) during the transaction, from primaries
    How can validation fail
    if an object's version has changed
    Commit backup
    the coordinator writes a commit record to the logs at each backup (it receives an ACK from the backup's NIC without the backup's CPU being involved)
    How can commit backup fail
    if any hardware failure occurs
    Commit primary
    after all commit backups have been ACKed, the coordinator writes a record to the logs at each primary
    the coordinator reports completion to the application - at least one ACK for such a record was received (including itself)
    primaries process the records by updating the objects in place, incrementing their versions and unlocking them. This exposes the writes committed by the transaction
    Truncate
    the coordinator lazily truncates logs at primaries and backups after receiving ACKs from all primaries
    backups apply updates to their copies of the objects at truncation time
  - Comparison with two phase commit protocol in Spanner
- Recovery
  - What techniques are used to ensure high availability
    data is available quickly by having a backup promoted to primary (after lock recovery). External clients do not have to wait while region replicas are brought up to date
    short failure detection time (10 ms) is enabled by having short leases. This is achieved by a fast network and dedicated network queues, dedicated and highly prioritized thread for managing leases as well as memory pre-allocation
    parallelism is used
    new transactions (on unaffected regions) occur in parallel with:
    transaction recovery
    data recovery (occurs after transaction recovery)
    recovery is parallelized across regions, machines and threads
    work is minimized by only limiting it to transactions and regions that were affected by reconfiguration
  - What is a recovering transaction
    one whose commit phase spans configuration changes and for which either one of the following has changed:
    replica of a written object
    primary of a read object
    coordinator
    it must be agreed upon as such by all machines
  - How is strict serializability ensured through the course of transaction recovery
    recovery preserves the outcome of transactions that were previously committed or aborted
    if a recovering transaction was committed, its log record is guaranteed to be processed and accepted before the log draining step
  - Phases
    Failure detection
    leases are granted using a 3-way handshake
    the expiry of a lease triggers failure recovery
    leases are prevented from expiring through heartbeat messages
    Reconfiguration
    initiated when lease expires
    the new CM remaps regions previously assigned to failed machines
    all external client requests are blocked when applying a new configuration and unblocked after it has been committed
    If the CM is suspected of failure, how are simultaneous reconfiguration attempts prevented
    the non-CM machine first requests a backup CM to initiate reconfiguration. Only if the configuration remains unchanged after a timeout does the non-CM machine attempt reconfiguration itself
    In case of a partition, how is the CM ensured to be in the larger partition
    a new CM issues an RDMA read to all machines in the configuration (except the machine that is suspected of failure) and proceeds with reconfiguration only if it obtains a majority of responses
    How can reconfiguration fail
    if their exist regions that have lost all their replicas
    there is no space to re-replicate regions
    Transaction recovery
    if a primary fails, a backup is promoted to be the new primary
    logs are drained to ensure all relevant records are processed. Following this, messages from the old configuration are no longer processed
    the primary of each region then builds the complete set of recovering transactions that affect the region
    the transactions are sharded across threads in order to be recovered
    locks are acquired for objects modified by recovering transactions. Following this, objects can be read and updates can be committed to this region in parallel with subsequent recovery steps
    Data recovery
    replication for new backups in order to be able to tolerate f replica failures in the future
    as it is not critical for normal operation, it is delayed in order to prioritize lock recovery which impacts latency
    each thread uses RDMA to read a block at a time from the primary

Further Questions
- Suppose there are two FaRM transactions that both increment the same object. They start at the same time and see the same initial value for the object. One transaction completely finishes committing (see Section 4 and Figure 4). Then the second transaction starts to commit. There are no failures. What is the evidence that FaRM will use to realize that it must abort the second transaction? At what point in the Section 4 / Figure 4 protocol will FaRM realize that it must abort?
  - FaRM will use the object's version number as evidence. Since both of them will have read the same value initially and one of the transaction has incremented the object's version (in the Commit primary phase), the other transaction will detect the changed version number in the Lock phase, after it receives the LOCK record from the coordinator. This will cause it to abort
- Why is RPC used instead of one-sided RDMA reads if the primary holds more than a certain number (4) of objects?
- How does the centralized approach to region allocation provide more flexibilty to satisfy failure independence and locality constraints (compared with consistent hashing)?
- Why is Zookeeper used for configuration consensus instead of vertical Paxos (as in the original FaRM paper)?
- In reconfiguration, how does precise membership prevent objects from being mutated until after lease expiry?
- Why are objects that are written not validated in the validate phase of a transaction commit?
  - they have already been validated in the Lock phase
- Why is coordinator state not replicated?

See Also
https://pdos.csail.mit.edu/6.824/notes/l-farm.txt
https://pdos.csail.mit.edu/6.824/papers/farm-faq.txt
https://blog.acolyer.org/2016/01/14/no-compromises/
Transaction Processing, Concepts and Techniques (Gray, Reuter)

References
https://pdos.csail.mit.edu/6.824/papers/farm-2015.pdf
https://pdos.csail.mit.edu/6.824/questions.html?q=q-farm&lec=14
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf
https://docs.oracle.com/cd/A87860_01/doc/appdev.817/a86030/adx16nt4.htm
https://www.youtube.com/watch?v=VtUpBtAOizc
https://github.com/ooibc88/gam/tree/farm