Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
C
cmsc818ePublicF2021b
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Requirements
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Peter Keleher
cmsc818ePublicF2021b
Commits
c5e8333c
Commit
c5e8333c
authored
3 years ago
by
Peter J. Keleher
Browse files
Options
Downloads
Patches
Plain Diff
auto
parent
7a58bc38
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
notes/spannerCockroach.md
+164
-0
164 additions, 0 deletions
notes/spannerCockroach.md
with
164 additions
and
0 deletions
notes/spannerCockroach.md
0 → 100644
+
164
−
0
View file @
c5e8333c
# Spanner
### Construction:
-
a single
*universe*
-
universe has many
*zones*
, each either a server farm or a part of
one. The unit of physical replication.
-
each zone has 100 .. several thousand
*spanservers*
-
a spanserver consists of multiple
*replicas*
## Spanserver:

-
implements between 100 and 1000
*tablets*
-
a tablet is a bag of mappings:
(key:string, timestamp:int64) --> string
-
runs a
*Paxos*
(long-lived leaders) instance on top of each tablet.
### Paxos group
-
leader replica implements
*lock table*
-
two-phase locking
-
leader also implements a
*transaction manager*
### Transaction implementation
-
If only a single Paxos instance (single tablet) involved,
transaction manager skipped.
-
Otherwise, two-phase commit across the paxos leaders.
### Directories (buckets)
-
tags in a tablet that have common prefixes
-
smallest unit of placement by apps
### Data Model
-
schematized semi-relational tables
-
query language
-
general-purpose read/write transactions
Motivated by:
-
popularity of megastore, even though it's slow
-
popularity of Dremel (data analysis tool)
-
complaints about bigtable: even cons across data centers, no
"cross-row" transactions (Percolator built to address this)
## TrueTime

Node time:
-
*TimeMaster*
machines per data center (GPS and/or atomic clocks)
-
*timeslave daemon*
per machine
-
Marzullo's algorithm to reject outlying masters
-
slowly increasing uncertainty between time syncs:
ε
between
1 and 7ms
Operations:
-
*read-write transactions*
-
*read-only transactions*
-
*snapshot reads*
-
transactions are
`internally`
re-tried (client not involved)
-
implicit extension of leases by successfull writes
### Assigning timestamps to RW transactions
-
two-phase locking
-
timestamp can be any time after all locks acquired, before any locks
released
-
spanners assigns trans the timestamp paxos gives the trans commit
write
> If start of T2 is after commit of T1, then the commit timestamp of
> T2 must be greater than that of T1.
**Commit wait**
:
Coordinator leader ensure that clients cannot see any data committed
by Ti until TT.after(s
\_
i) is true
-
expected wait is at least 2
*
Ε
-
wait is usually overlapped w/ Paxos execution
[
The Talk
](
https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
)
Thoughts:
-
ian: synergy between long-lived leaders and two-phase locking.
-
would love to know how active
`MoveDir`
is, and more about it's policies.
# CockroachDB: Life Without Atomic Clocks
## Quotes
-
"our challenge was in providing similar guarantees of external consistency
without atomic clocks."
-
"A system which provides an absolute ordering of causally related events,
regardless of observer, provides for a particularly strong guarantee of
external consistency known as “linearizability.”
-
"in a non-distributed database, serializability implies linearizability"
-
"CockroachDB’s external consistency guarantee is by default only
serializability, though with some features that can help bridge the gap in
practice."
"Before a node is allowed to report that a transaction has committed, it must
wait 7ms. Because all clocks in the system are within 7ms of each other,
waiting 7ms means that no subsequent transaction may commit at an earlier
timestamp, even if the earlier transaction was committed on a node with a
clock which was fast by the maximum 7ms"
Approach:
-
"The possibility of reordering commit timestamps for causally related
transactions is likely a marginal problem in practice."
-
"What
*could*
happen is that examining the database at a historical timestamp
might yield paradoxical situations where transaction A is not yet visible
while transaction B is, even though transaction A is known to have preceded
B, as they’re causally related. However, this can only happen if there’s no
overlap between the keys read or written during the transactions."
## Approach
**causality token**
- max timestamp encountered (anywhere) in a transaction.
-
used to ensure that causal chains maintained by serving as minimum commit
timestamp for that transaction.
**Problem:**
-
*Ti*
reading data from mult nodes might fail to read already-committed data
**Solution:**
-
choose xtion timestamp based on node wall time: [
*commit timestamp , commit timestamp + max uncertainty*
]
-
define upper bound by adding maxoffset
-
read quickly as long as no version seen s.t.:
-
version after commit timestamp
-
but before upper bound
-
if so, re-start transaction w/ version's timestamp
-
upper bound doesn't change.
-
lower bound gets
*higher*
, just above the problematic node write (window shrinks)
-
never restart based on same-key version read from same node in one
transaction.
So:
-
spanner always short-waits on writes for short interval (~7ms)
-
cockroachdb sometimes long waits on reads (~250ms)
-
causal reverse still possible (though probably not a problem in practice)
Strict serializability makes all of the guarantees of one-copy serializability
that we discussed above. In addition, it guarantees that if a transaction X
completed before transaction Y started (in real time) then X will be placed
before Y in the serial order
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment