tango.md



Tango Distributed Data Structures over a Shared Log

Why?
Existing systems build abstractions for computing over massive data sets:

hadoop
Spark

Need "application metadata", with persistence and high availability.

maps
counters
queues
graphs
job assignments
network topologies
.....


How?

client modifies object by appending an update to the log
accesses the object by sync'ing local view w/ log

elasticity - scaling throughput of linearizable reads by adding new views,
w/o slowing write throughput.  ("until saturation")
transaction atomicity and isolation from log

streams to filter log seen at clients


Transactions:


optimistic concurrency control

writes entered in log as speculative
commit record contains a read set w/ versions.
transaction succeeds if read objects current at commit record.
each reader deterministically evaluates commit record

read transactions:

nothing inserted in log
locally track time (offset) of first read (start of transaction), and last read (end of transaction)
commit/abort by as in ordinary read/write transactions


write transactions always commit.


can use fine-grained per-app versions

opaque key parameters in helper funcs


crashed client's transaction aborted by others appending crash record


Streams

per-object stream
transaction commits multi-appended to all relevant streams

remote-write transactions
decision records
a client executing a transaction must insert a decision record for a transaction if there’s some other client in the system that hosts an object in its write set but not all the objects in its read set. "
generating clients can not do a remote read in trans (would require RPCs...)


Comments/Questions

(claude) why metadata?