190 likes | 398 Views
Nectar: Automatic Management of Data and Computation in Data centers . Presented By VARSHA PARTHASARATHY Nov 29 th 2010 . A paper by Pradeep Kumar Gunda , Lenin Ravichnadran , Chandramohan A.Thekkath , Yuan Yu, Li Zhuang. Contents. Introduction Nectar : A Definition
E N D
Nectar: Automatic Management of Data and Computation in Data centers Presented By VARSHA PARTHASARATHY Nov 29th 2010 . A paper by Pradeep Kumar Gunda, Lenin Ravichnadran, ChandramohanA.Thekkath, Yuan Yu, Li Zhuang
Contents • Introduction • Nectar : A Definition • Advantages of Nectar • System Design • Client side Library • Datacenter-Wide Service • Implementation • Evaluation • Conclusion
Introduction • Managing Data and Computation – heart of Data Center Computing. • Manual management leads to : • Data Loss. • Wasteful Consumption of Storage. • Laborious Bookkeeping. • Sharing becomes a tedious process. • Nectar is a system designed to address the above mentioned problems.
Nectar : A Definition • Nectar unifies and automates the management of data and computation in the data center. • Two Datasets: Primary and Derived. • Data center wide Caching Service , manages computations and storage.
Advantages of Nectar • Efficient space utilization. • Reuse of shared sub-computations. • Ease of content management. • Incremental computations.
Client side Library • Cache Key Calculation: Fingerprint of the program and the input datasets as the cache key. • Cost Estimator: Choosing optimal, cost efficient program. • Rewriter: Three rewriting scenarios are, • Common sub-expressions: program is represented as a LINQ expression tree. • Incremental query plans: Find c s.t. P(D + D`) = C(P(D),D`) • Incremental query plans for sliding windows: D1 = d1 + d2 + ::: + dn; D2 = d2 + d3 + ::: + dn+1; D3 = d3 + d4 + ::: + dn+2;
Datacenter-Wide Service: • Cache Service: Two basic functionalities are, • Serving the cache lookup requests by the Nectar rewriter. • managing derived datasets by deleting the cache entries of the least value. • Garbage collector: identify datasets unreachable from any cache entry and delete them.
Implementation • Caching Computations: • identifying all sub-expressions of the expression. • probing the cache server for all cache hits for the sub-expressions. • using the cache hits to rewrite it into a set of equivalent expressions. • choosing one that gives us the maximum benefit based on some cost estimation. • Managing Derived Data: • Derived datasets: take up a significant amount of storage and space in a data center. • Nectar deletes the ones of the least value unused or seldom used.
Cache and Program • A cache server supports following functions: • Lookup(fp) • Inquire(fp) • AddEntry • A cache entry is of the form: <FPPD, FPP , Result, Statistics, FPList> FPPD: Primary key of the entry. FPp: Fingerprint of the program. Statistics: Used by the rewriter to find optimal execution plan & contains cumulative execution time, the number of hits on this entry, and the last access time.
The Rewriting Algorithm • Step1: At each sub-expression, we probe the cache server to obtain all the possible hits on it. Let us denote the set of hits by H. • Step2: If there is a hit on the entire D, we just use that hit and stop exploring its sub-expressions, else we compute the best hit for the current expression using smaller prefixes, and then choose the best among it and H. • Step3: To choose a subset of it such that they operate on disjoint subsequence of D and give us the most saving in terms of cumulative execution time.
Cache Insertion Policy • Adding a cache entry incurs additional cost, space overhead if the entry is not useful. • However, determining the potential usefulness of a cache entry is generally difficult. • Caching decision is made in two phases: • When the rewriter rewrites the expression, it decides on the places in the expression to insert AddEntry calls. • The final insertion decision is made based on the runtime information of the execution of the sub-expression. • It is proportional to the execution time and inversely proportional to storage overhead.
Example of Rewriting • var groups = source.GroupBy(KeySelect); • varreduced = groups.Select(Reduce);
Garbage Collection • When the available disk space falls below a threshold, the system automatically deletes the derived datasets that are considered to be least useful in the future. • Eviction policy is based on the cost-to benefit ratio: • Ratio = (S ∆T)/(N xM) • Nectar scans the entire cache, computing the cost-to-benefit ratio for each cache entry, then sorts the cache & deletes the top ‘n’ entries such that the pre-defined threshold is reached.
Evaluation • Evaluation of the system using both analysis of actual logs from a number of production clusters and an actual deployment on a 240-node cluster. • The reports prove that there is a huge potential value in using Nectar to manage the computation and data in a large data center.
Conclusion • Feedback has been quite positive. The most popular comment is that the system makes program debugging much more interactive and fun. • The Nectar when used on a daily basis, and found a big increase in our productivity.