Skip to content
Snippets Groups Projects
Commit c5e8333c authored by Peter J. Keleher's avatar Peter J. Keleher
Browse files

auto

parent 7a58bc38
No related branches found
No related tags found
No related merge requests found
# Spanner
### Construction:
- a single *universe*
- universe has many *zones*, each either a server farm or a part of
one. The unit of physical replication.
- each zone has 100 .. several thousand *spanservers*
- a spanserver consists of multiple *replicas*
## Spanserver:
![Spanserver](spanserverSoftwareStack.png)
- implements between 100 and 1000 *tablets*
- a tablet is a bag of mappings:
(key:string, timestamp:int64) --> string
- runs a *Paxos* (long-lived leaders) instance on top of each tablet.
### Paxos group
- leader replica implements *lock table*
- two-phase locking
- leader also implements a *transaction manager*
### Transaction implementation
- If only a single Paxos instance (single tablet) involved,
transaction manager skipped.
- Otherwise, two-phase commit across the paxos leaders.
### Directories (buckets)
- tags in a tablet that have common prefixes
- smallest unit of placement by apps
### Data Model
- schematized semi-relational tables
- query language
- general-purpose read/write transactions
Motivated by:
- popularity of megastore, even though it's slow
- popularity of Dremel (data analysis tool)
- complaints about bigtable: even cons across data centers, no
"cross-row" transactions (Percolator built to address this)
## TrueTime
![Truetime API](trutime1.png)
Node time:
- *TimeMaster* machines per data center (GPS and/or atomic clocks)
- *timeslave daemon* per machine
- Marzullo's algorithm to reject outlying masters
- slowly increasing uncertainty between time syncs: ε between
1 and 7ms
Operations:
- *read-write transactions*
- *read-only transactions*
- *snapshot reads*
- transactions are `internally` re-tried (client not involved)
- implicit extension of leases by successfull writes
### Assigning timestamps to RW transactions
- two-phase locking
- timestamp can be any time after all locks acquired, before any locks
released
- spanners assigns trans the timestamp paxos gives the trans commit
write
> If start of T2 is after commit of T1, then the commit timestamp of
> T2 must be greater than that of T1.
**Commit wait**:
Coordinator leader ensure that clients cannot see any data committed
by Ti until TT.after(s\_i) is true
- expected wait is at least 2 * Ε
- wait is usually overlapped w/ Paxos execution
[The Talk](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett)
Thoughts:
- ian: synergy between long-lived leaders and two-phase locking.
- would love to know how active `MoveDir` is, and more about it's policies.
# CockroachDB: Life Without Atomic Clocks
## Quotes
- "our challenge was in providing similar guarantees of external consistency
without atomic clocks."
- "A system which provides an absolute ordering of causally related events,
regardless of observer, provides for a particularly strong guarantee of
external consistency known as “linearizability.”
- "in a non-distributed database, serializability implies linearizability"
- "CockroachDB’s external consistency guarantee is by default only
serializability, though with some features that can help bridge the gap in
practice."
"Before a node is allowed to report that a transaction has committed, it must
wait 7ms. Because all clocks in the system are within 7ms of each other,
waiting 7ms means that no subsequent transaction may commit at an earlier
timestamp, even if the earlier transaction was committed on a node with a
clock which was fast by the maximum 7ms"
Approach:
- "The possibility of reordering commit timestamps for causally related
transactions is likely a marginal problem in practice."
- "What *could* happen is that examining the database at a historical timestamp
might yield paradoxical situations where transaction A is not yet visible
while transaction B is, even though transaction A is known to have preceded
B, as they’re causally related. However, this can only happen if there’s no
overlap between the keys read or written during the transactions."
## Approach
**causality token** - max timestamp encountered (anywhere) in a transaction.
- used to ensure that causal chains maintained by serving as minimum commit
timestamp for that transaction.
**Problem:**
- *Ti* reading data from mult nodes might fail to read already-committed data
**Solution:**
- choose xtion timestamp based on node wall time: [*commit timestamp , commit timestamp + max uncertainty*]
- define upper bound by adding maxoffset
- read quickly as long as no version seen s.t.:
- version after commit timestamp
- but before upper bound
- if so, re-start transaction w/ version's timestamp
- upper bound doesn't change.
- lower bound gets *higher*, just above the problematic node write (window shrinks)
- never restart based on same-key version read from same node in one
transaction.
So:
- spanner always short-waits on writes for short interval (~7ms)
- cockroachdb sometimes long waits on reads (~250ms)
- causal reverse still possible (though probably not a problem in practice)
Strict serializability makes all of the guarantees of one-copy serializability
that we discussed above. In addition, it guarantees that if a transaction X
completed before transaction Y started (in real time) then X will be placed
before Y in the serial order
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment