1 / 18

Nectar: Automatic Management of Data and Computation in Data centers

Nectar: Automatic Management of Data and Computation in Data centers . Presented By VARSHA PARTHASARATHY Nov 29 th 2010 . A paper by Pradeep Kumar Gunda , Lenin Ravichnadran , Chandramohan A.Thekkath , Yuan Yu, Li Zhuang. Contents. Introduction Nectar : A Definition

shina
Download Presentation

Nectar: Automatic Management of Data and Computation in Data centers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nectar: Automatic Management of Data and Computation in Data centers Presented By VARSHA PARTHASARATHY Nov 29th 2010 . A paper by Pradeep Kumar Gunda, Lenin Ravichnadran, ChandramohanA.Thekkath, Yuan Yu, Li Zhuang

  2. Contents • Introduction • Nectar : A Definition • Advantages of Nectar • System Design • Client side Library • Datacenter-Wide Service • Implementation • Evaluation • Conclusion

  3. Introduction • Managing Data and Computation – heart of Data Center Computing. • Manual management leads to : • Data Loss. • Wasteful Consumption of Storage. • Laborious Bookkeeping. • Sharing becomes a tedious process. • Nectar is a system designed to address the above mentioned problems.

  4. Nectar : A Definition • Nectar unifies and automates the management of data and computation in the data center. • Two Datasets: Primary and Derived. • Data center wide Caching Service , manages computations and storage.

  5. Advantages of Nectar • Efficient space utilization. • Reuse of shared sub-computations. • Ease of content management. • Incremental computations.

  6. System Design

  7. Client side Library • Cache Key Calculation: Fingerprint of the program and the input datasets as the cache key. • Cost Estimator: Choosing optimal, cost efficient program. • Rewriter: Three rewriting scenarios are, • Common sub-expressions: program is represented as a LINQ expression tree. • Incremental query plans: Find c s.t. P(D + D`) = C(P(D),D`) • Incremental query plans for sliding windows: D1 = d1 + d2 + ::: + dn; D2 = d2 + d3 + ::: + dn+1; D3 = d3 + d4 + ::: + dn+2;

  8. Datacenter-Wide Service: • Cache Service: Two basic functionalities are, • Serving the cache lookup requests by the Nectar rewriter. • managing derived datasets by deleting the cache entries of the least value. • Garbage collector: identify datasets unreachable from any cache entry and delete them.

  9. Implementation • Caching Computations: • identifying all sub-expressions of the expression. • probing the cache server for all cache hits for the sub-expressions. • using the cache hits to rewrite it into a set of equivalent expressions. • choosing one that gives us the maximum benefit based on some cost estimation. • Managing Derived Data: • Derived datasets: take up a significant amount of storage and space in a data center. • Nectar deletes the ones of the least value unused or seldom used.

  10. Cache and Program • A cache server supports following functions: • Lookup(fp) • Inquire(fp) • AddEntry • A cache entry is of the form: <FPPD, FPP , Result, Statistics, FPList> FPPD: Primary key of the entry. FPp: Fingerprint of the program. Statistics: Used by the rewriter to find optimal execution plan & contains cumulative execution time, the number of hits on this entry, and the last access time.

  11. The Rewriting Algorithm • Step1: At each sub-expression, we probe the cache server to obtain all the possible hits on it. Let us denote the set of hits by H. • Step2: If there is a hit on the entire D, we just use that hit and stop exploring its sub-expressions, else we compute the best hit for the current expression using smaller prefixes, and then choose the best among it and H. • Step3: To choose a subset of it such that they operate on disjoint subsequence of D and give us the most saving in terms of cumulative execution time.

  12. Cache Insertion Policy • Adding a cache entry incurs additional cost, space overhead if the entry is not useful. • However, determining the potential usefulness of a cache entry is generally difficult. • Caching decision is made in two phases: • When the rewriter rewrites the expression, it decides on the places in the expression to insert AddEntry calls. • The final insertion decision is made based on the runtime information of the execution of the sub-expression. • It is proportional to the execution time and inversely proportional to storage overhead.

  13. Example of Rewriting • var groups = source.GroupBy(KeySelect); • varreduced = groups.Select(Reduce);

  14. Creation of Derived Dataset

  15. Garbage Collection • When the available disk space falls below a threshold, the system automatically deletes the derived datasets that are considered to be least useful in the future. • Eviction policy is based on the cost-to benefit ratio: • Ratio = (S ∆T)/(N xM) • Nectar scans the entire cache, computing the cost-to-benefit ratio for each cache entry, then sorts the cache & deletes the top ‘n’ entries such that the pre-defined threshold is reached.

  16. Evaluation • Evaluation of the system using both analysis of actual logs from a number of production clusters and an actual deployment on a 240-node cluster. • The reports prove that there is a huge potential value in using Nectar to manage the computation and data in a large data center.

  17. Conclusion • Feedback has been quite positive. The most popular comment is that the system makes program debugging much more interactive and fun. • The Nectar when used on a daily basis, and found a big increase in our productivity.

  18. THANK YOU

More Related