Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Spanner

Construction:

  • a single universe
  • universe has many zones, each either a server farm or a part of one. The unit of physical replication.
  • each zone has 100 .. several thousand spanservers
  • a spanserver consists of multiple replicas

Spanserver:

Spanserver

  • implements between 100 and 1000 tablets

  • a tablet is a bag of mappings:

    (key:string, timestamp:int64) --> string

  • runs a Paxos (long-lived leaders) instance on top of each tablet.

Paxos group

  • leader replica implements lock table
  • two-phase locking
  • leader also implements a transaction manager

Transaction implementation

  • If only a single Paxos instance (single tablet) involved, transaction manager skipped.
  • Otherwise, two-phase commit across the paxos leaders.

Directories (buckets)

  • tags in a tablet that have common prefixes
  • smallest unit of placement by apps

Data Model

  • schematized semi-relational tables
  • query language
  • general-purpose read/write transactions

Motivated by:

  • popularity of megastore, even though it's slow
  • popularity of Dremel (data analysis tool)
  • complaints about bigtable: even cons across data centers, no "cross-row" transactions (Percolator built to address this)

TrueTime

Truetime API

Node time:

  • TimeMaster machines per data center (GPS and/or atomic clocks)
  • timeslave daemon per machine
  • Marzullo's algorithm to reject outlying masters
  • slowly increasing uncertainty between time syncs: ε between 1 and 7ms

Operations:

  • read-write transactions
  • read-only transactions
  • snapshot reads
  • transactions are internally re-tried (client not involved)
  • implicit extension of leases by successfull writes

Assigning timestamps to RW transactions

  • two-phase locking
  • timestamp can be any time after all locks acquired, before any locks released
  • spanners assigns trans the timestamp paxos gives the trans commit write

If start of T2 is after commit of T1, then the commit timestamp of T2 must be greater than that of T1.

Commit wait: Coordinator leader ensure that clients cannot see any data committed by Ti until TT.after(s_i) is true

  • expected wait is at least 2 * Ε
  • wait is usually overlapped w/ Paxos execution

The Talk

Thoughts:

  • ian: synergy between long-lived leaders and two-phase locking.
  • would love to know how active MoveDir is, and more about it's policies.

CockroachDB: Life Without Atomic Clocks

Quotes

  • "our challenge was in providing similar guarantees of external consistency without atomic clocks."
  • "A system which provides an absolute ordering of causally related events, regardless of observer, provides for a particularly strong guarantee of external consistency known as “linearizability.”
  • "in a non-distributed database, serializability implies linearizability"
    • single clock
    • if T2 starts after T1 finishes, that will be equiv serial order, and so... blah
  • "CockroachDB’s external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice."

"Before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms"

Approach:

  • "The possibility of reordering commit timestamps for causally related transactions is likely a marginal problem in practice."

  • "What could happen is that examining the database at a historical timestamp might yield paradoxical situations where transaction A is not yet visible while transaction B is, even though transaction A is known to have preceded B, as they’re causally related. However, this can only happen if there’s no overlap between the keys read or written during the transactions."

Approach

causality token - max timestamp encountered (anywhere) in a transaction.

  • used to ensure that causal chains maintained by serving as minimum commit timestamp for that transaction.

Problem:

  • Ti reading data from mult nodes might fail to read already-committed data

Solution:

  • choose xtion timestamp based on node wall time: [commit timestamp , commit timestamp + max uncertainty]
  • define upper bound by adding maxoffset
  • read quickly as long as no version seen s.t.:
    • version after commit timestamp
    • but before upper bound
  • if so, re-start transaction w/ version's timestamp
    • upper bound doesn't change.
    • lower bound gets higher, just above the problematic node write (window shrinks)
    • never restart based on same-key version read from same node in one transaction.

So:

  • spanner always short-waits on writes for short interval (~7ms)
  • cockroachdb sometimes long waits on reads (~250ms)
  • causal reverse still possible (though probably not a problem in practice)

Strict serializability makes all of the guarantees of one-copy serializability that we discussed above. In addition, it guarantees that if a transaction X completed before transaction Y started (in real time) then X will be placed before Y in the serial order