240 likes | 418 Views
Incremental Recomputations in MapReduce. Thomas Jörg University of Kaiserslautern. Motivation. MapReduce Program. Base data. Result data. Bigtable / HBase. Motivation. View Definition. . Base data. Materialized view. Motivation. incrementalMapReduce Program.
E N D
Incremental Recomputationsin MapReduce Thomas JörgUniversity of Kaiserslautern
Motivation MapReduce Program Base data Resultdata Bigtable / HBase
Motivation View Definition Base data Materializedview
Motivation incrementalMapReduce Program MapReduce Program Base data Resultdata Bigtable / HBase
Agenda • Related Work • Case study • Incremental view maintenance • Summary Delta Algorithm • Conclusion and future work
Related Work • Caching intermediate results • DryadInc • Incoop • Incremental programming models • Google Percolator • Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010
Challenges • Programming model • SQL / relational algebra vs. MapReduce • Efficient access paths • No secondary indexes in Hbase • Support for transactions • Only single-row transactions in Hbase
Case Study • Word histograms • Reverse web-link graphs • Term-vectors per host • Count of URL access frequency • Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
<html> ... </html> Computing Reverse Web-Link Graphs <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> Thomas Jörg, Technische Universität Kaiserslautern 9 <html> ... </html> <html> ... </html> <html> ... </html>
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html>
Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} • b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm a.htm, {b.htm} • b.htm, b.htm
Summary Delta Algorithm CREATE VIEW Parts AS SELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcnt FROM Orders GROUP BY partID SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcnt FROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID) ) GROUP BY partID I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000
Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} • b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm a.htm, {b.htm} • b.htm, b.htm
Achieving Self-Maintainability Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} • b.htm, [a.htm, 1] b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, [b.htm, 1] a.htm, {[b.htm, 1]} • b.htm, [b.htm, 1]
Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> <html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html>
Summary Delta Algorithm in MapReduce a.htm (deleted) Map Shuffle Reduce <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, -1] • b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a> </html> b.htm, [a.htm, +1] • a.htm, [a.htm, +1]
Delta Installation Approaches MapReduce Base deltas Materialized view Increment Installation Materialized view MapReduce Base deltas Materialized view Overwrite Installation
Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation • Reverse web-link graph • Inverted index • Multiset aggregation • Term-vector per host
General Solution • Self-maintainable aggregates • Computed in three steps • Translation • Grouping • Aggregation • commutative and associative binary function • inverse elements • Abelian group
Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation • Reverse web-link graph • Inverted index • Multiset aggregation • Term-vector per host Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) Translation function: Translate web pages into (link target, link source) Aggregation function: Abelian group (Power-multiset of URLs, multiset union)
Evaluation y-axis: Elapsed time [min] x-axis: Updates in basedocuments [%]
Conclusion & Future Work • View Maintenance in MapReduce • Case study • Summary delta algorithm • Self-maintainable aggregations • Future Work • Broader class of MapReduce programs • High-level MapReduce languages, e.g. Jaql or PigLatin