280 likes | 414 Views
Edelweiss: Automatic Storage Reclamation for Distributed Programming. Neil Conway Peter Alvaro Emily Andrews Joseph M. Hellerstein University of California, Berkeley. Mutable shared state. Frequent source of bugs. Hard to scale. Accumulate & exchange sets of immutable events
E N D
Edelweiss:Automatic Storage Reclamation for Distributed Programming Neil Conway Peter Alvaro Emily Andrews Joseph M. Hellerstein University of California, Berkeley
Mutable shared state Frequent sourceof bugs Hard to scale
Accumulate& exchange sets of immutable events • No mutation/deletion • To delete: add new event • “Event X should be ignored” • Current state: query over event log EventLogging
Example: Key-Value Store Event Logging i_log = Set.new d_log = Set.new Insert(k, v): i_log << [k,v] Delete(k): d_log << k View(): i_log.notin(d_log, :k => :k) Mutable State tbl = Hash.new Insert(k, v): tbl[k] = v Delete(k): tbl.delete(k) View(): tbl Update-in-place Set union Deletion Compute “live” keys
Benefits of Event Logging • Concurrency • Replication • Undo/redo • Point-in-time query, audit trails (Sometimes: performance!)
Example Applications • Multi-version concurrency control (MVCC) • Write-ahead logging (WAL) • Stream processing • Log-structured file systems Also: CRDTs, tombstones, purely functional data structures, accounting ledgers.
Observation: Logs consume unbounded storage Solution: Discard log entries that are“no longer useful”(garbage collection)
Observation: Logs consume unbounded storage Challenge: Discard log entries that are“no longer useful”(garbage collection)
Traditional Approach “No longer useful” defined by application semantics • No framework support • Every system requires custom GC logic • Reinvented many times • >25 papers propose ~same scheme!
Engineering Challenges • Difficult to implement correctly • Too aggressive: destroy live data • Too conservative: storage leak • Ongoing maintenance burden • GC scheme and application code must be updated together
Our Approach • New language: Edelweiss • Based on Datalog • No constructs for deletion or mutation! • Automatically generate safe, application-specific distributed GC protocols • Present several in-depth case studies • Reliable unicast/broadcast, key-value store, causal consistency, atomic registers
Base Data (“Event Logs”) Derived Data ( “Live View”) Query
A log entry is useful iff it might contribute to the view. The queries define how log entries contribute to the view. Goal:Find log entries that will never contribute to the viewin the future.
Semantics of Base Data • Accumulate and broadcast to other nodes • Datalog: monotonic • Set union: grows over time • CALM Theorem [CIDR’11]: event log guaranteed to be eventually consistent
Semantics of Derived Data Growsand shrinksover time • e.g., KVS keys added and removed Hence,not monotonic
Common Pattern Live View = set difference between growing sets
Semantics of Set Difference X= Y – Z • Z grows: Xshrinks • If tappears in Z, t will never again appear in X • “Anti-monotone with respect to Z” i_log = Set.new d_log = Set.new Insert(k, v): i_log << [k,v] Delete(k): d_log << k View(): i_log.notin(d_log, :k => :k) Can reclaim from i_logupon match in d_log
Other Analysis Techniques • Reclaim from negative notin input • Often called “tombstones” • E.g., how to reclaim from d_log in the KVS • Reclaim from join input tables • DisseminateGC metadata automatically • Exploit user knowledge for better GC • Punctuations [Tucker & Maier ‘03]
Whole Program Analysis • For each query q, find condition when input t will never contribute to q’s output • “Reclamation condition” (RC) • For each tuple t, find the conjunction of the RCs for t over all queries • When all consumers no longer need t: safe to reclaim
Input program + deletion rules “Positive” program:no deletion or statemutation Edelweiss Input Program Source To Source Rewriter Datalog Output Program Datalog Evaluator Compute RCs, add deletion rules
Comparison of Program Size Only19 rules!
Takeaways • No storage management code! • Similar tomalloc/free vs. GC • Programs are concise and declarative • Developer: just compute current view • Log entries removed automatically • Reclamation logic application code always in sync
Conclusions • Event logging: powerful design pattern • Problem: need for hand-written distributed storage reclamation code • Datalog: natural fit for event logging • Storage reclamation as a compiler rewrite? Results: • Automatic, safe GC synthesis! • High-level, declarative programs • No storage management code • Focus on solving domain problem
Future Work: Checkpoints • Closely related to simple event logging • Summarize many log entries with a single “checkpoint”record • View = last checkpoint + Query(¢Logs) • General goal: reclaim space by structural transformation, not just discarding data
Future Work: Theory • Current analysis is somewhat ad hoc • If program does not reclaim storage, two possibilities: • Program is “not reclaimable” in principle • (Possible program bug!) • Our analysis is not complete • (Possible analysis bug!) How to characterize the class of “not reclaimable” programs?
Reclaiming KVS Deletions • Good question • X.notin(Y): how to reclaim from Y? • Y is a dense ordered set; compress it. • Prove that each Y tuple matches exactly oneX tuple i_log = Set.new d_log = Set.new Insert(k, v): i_log << [k,v] Delete(k): d_log << k View(): i_log.notin(d_log, :k => :k) k is a keyof i_log