Nectar: Automatic Management of Data and Computation in Data centers

Nectar: Automatic Management of Data and Computation in Data centers Presented By VARSHA PARTHASARATHY Nov 29th 2010 . A paper by Pradeep Kumar Gunda, Lenin Ravichnadran, ChandramohanA.Thekkath, Yuan Yu, Li Zhuang

Contents • Introduction • Nectar : A Definition • Advantages of Nectar • System Design • Client side Library • Datacenter-Wide Service • Implementation • Evaluation • Conclusion

Introduction • Managing Data and Computation – heart of Data Center Computing. • Manual management leads to : • Data Loss. • Wasteful Consumption of Storage. • Laborious Bookkeeping. • Sharing becomes a tedious process. • Nectar is a system designed to address the above mentioned problems.

Nectar : A Definition • Nectar unifies and automates the management of data and computation in the data center. • Two Datasets: Primary and Derived. • Data center wide Caching Service , manages computations and storage.

Advantages of Nectar • Efficient space utilization. • Reuse of shared sub-computations. • Ease of content management. • Incremental computations.

System Design

Client side Library • Cache Key Calculation: Fingerprint of the program and the input datasets as the cache key. • Cost Estimator: Choosing optimal, cost efficient program. • Rewriter: Three rewriting scenarios are, • Common sub-expressions: program is represented as a LINQ expression tree. • Incremental query plans: Find c s.t. P(D + D`) = C(P(D),D`) • Incremental query plans for sliding windows: D1 = d1 + d2 + ::: + dn; D2 = d2 + d3 + ::: + dn+1; D3 = d3 + d4 + ::: + dn+2;

Datacenter-Wide Service: • Cache Service: Two basic functionalities are, • Serving the cache lookup requests by the Nectar rewriter. • managing derived datasets by deleting the cache entries of the least value. • Garbage collector: identify datasets unreachable from any cache entry and delete them.

Implementation • Caching Computations: • identifying all sub-expressions of the expression. • probing the cache server for all cache hits for the sub-expressions. • using the cache hits to rewrite it into a set of equivalent expressions. • choosing one that gives us the maximum benefit based on some cost estimation. • Managing Derived Data: • Derived datasets: take up a significant amount of storage and space in a data center. • Nectar deletes the ones of the least value unused or seldom used.

Cache and Program • A cache server supports following functions: • Lookup(fp) • Inquire(fp) • AddEntry • A cache entry is of the form: <FPPD, FPP , Result, Statistics, FPList> FPPD: Primary key of the entry. FPp: Fingerprint of the program. Statistics: Used by the rewriter to find optimal execution plan & contains cumulative execution time, the number of hits on this entry, and the last access time.

The Rewriting Algorithm • Step1: At each sub-expression, we probe the cache server to obtain all the possible hits on it. Let us denote the set of hits by H. • Step2: If there is a hit on the entire D, we just use that hit and stop exploring its sub-expressions, else we compute the best hit for the current expression using smaller prefixes, and then choose the best among it and H. • Step3: To choose a subset of it such that they operate on disjoint subsequence of D and give us the most saving in terms of cumulative execution time.

Cache Insertion Policy • Adding a cache entry incurs additional cost, space overhead if the entry is not useful. • However, determining the potential usefulness of a cache entry is generally difficult. • Caching decision is made in two phases: • When the rewriter rewrites the expression, it decides on the places in the expression to insert AddEntry calls. • The final insertion decision is made based on the runtime information of the execution of the sub-expression. • It is proportional to the execution time and inversely proportional to storage overhead.

Example of Rewriting • var groups = source.GroupBy(KeySelect); • varreduced = groups.Select(Reduce);

Creation of Derived Dataset

Garbage Collection • When the available disk space falls below a threshold, the system automatically deletes the derived datasets that are considered to be least useful in the future. • Eviction policy is based on the cost-to benefit ratio: • Ratio = (S ∆T)/(N xM) • Nectar scans the entire cache, computing the cost-to-benefit ratio for each cache entry, then sorts the cache & deletes the top ‘n’ entries such that the pre-defined threshold is reached.

Evaluation • Evaluation of the system using both analysis of actual logs from a number of production clusters and an actual deployment on a 240-node cluster. • The reports prove that there is a huge potential value in using Nectar to manage the computation and data in a large data center.

Conclusion • Feedback has been quite positive. The most popular comment is that the system makes program debugging much more interactive and fun. • The Nectar when used on a daily basis, and found a big increase in our productivity.

THANK YOU

Nectar: Automatic Management of Data and Computation in Data centers

Nectar: Automatic Management of Data and Computation in Data centers

Presentation Transcript

Nectar: Automatic Management of Data and Computation in Data Centers

Data Centers

Flyways in Data Centers

Data Centers Trends

Big Data, Computation and Statistics

Nectar: Efficient Management of Computation and Data in Data Centers

Nectar: Automatic Management of Data and Computation in Data Centers

Cost of Data Centers

Thermal Aware Data Management in Cloud based Data Centers

Data Management at Gaia Data Processing Centers

From Internet Data Centers to Data Centers in the Cloud

Thermal Management of Heterogeneous Data Centers

Dynamic Resource Management in Internet Data Centers

Scalable Rule Management for Data Centers

Data Stream Computation

IGS Data Centers and Data Access

Resource Management in Virtualization-based Data Centers

COMPUTATION OF DISCHARGE DATA

Automatic Data

Automated Workload Management in Virtualized Data Centers

Data Management at Gaia Data Processing Centers