In-situ MapReduce for Log Processing

In-situ MapReduce for Log Processing 공과 대학 컴퓨터학과 데이터베이스 연구실 김 윤호

Index • Introduction • Design overview • Lossy MapReduce processing • Prototype • Evaluation • Conclusion

1. Introduction • Log • Click log • System and network log • Application log • E-commerce and credit card company • Infrastructure provider

1. Introduction • Store-first-query-later CentralizedCompute Cluster Log Log Log Log Log Log ……

1. Introduction • Two drawbacks • Scale and Timeliness • Sacrifice availability orReturn in complete results

1. Introduction • Strict consistency

1. Introduction • Systematic method CentralizedCompute Cluster Log Log Log Log Log Log ……

1. Introduction • in-situ” MapReduce (iMR) architecture • Move analysis to the servers • MapReduce for continuous data • Ability to trade fidelity for latency CentralizedCompute Cluster …… Log Log Log Log Log Log

1. Introduction • Differ from a dedicated Hadoop cluster Distributed file system share Node …… Node

1. Introduction • Continuous MapReduce model • Lossy MapReduce processing • Architectural lessons • Best-effort distributed stream processor, Mortar • Sub-windows or panes • Impact of failures on result fidelity and latency • Load cancellation and shedding policies

2. Design overview • iMR - complement, not replace • Scalable • Responsive • Available • Efficient • Compatible

2. Design overview • Identical MapReduce job in iMR • Map • Reduce • iMR jobs emit a stream of results computed over continuous input.

2. Design overview • Aggregation trees for efficiency • Distribute processing load • Reduce network traffic

2. Design overview time • Sliding windows • Range of data Log entries …… Map / Combine Reduce

2. Design overview time • Problem • Overlapping data • Wastes CPU, Network Overlapping data …… Map / Combine

2. Design overview • Eliminate redundant work • Panes (sub-windows) • Root combines panesto produce window • Saves CPU & network resources P1 P2 P3 P4 …… time Map / Combine P1 P2 P3 P4 Reduce

3. LossyMapReduce processing • Data loss may occur • Nodeof network failures • Consequence of result latency requirements • Data loss is unavoidable to ensure timeliness • How to represent and calculate result qualityto allow users to interpret partial results? • How to use this metricto trade result fidelity for improved result latency?

3. LossyMapReduce processing • Completeness metric C2 • Distribution of log data across • Space (Log server nodes) • Time (the window range) • Root maintains C2 like a scoreboard. space time

3. Lossy MapReduce processing • Area (A) with earliest results • The most freedom to decrease latency • Appropriate for uniformly distributed events • Area (A) with random sampling • Less freedom to decrease latency • Appropriate even for non-uniform data • Spatial completeness (X, 100%) • Useful when events are local to a node • Temporal completeness (100%, Y) • Useful for correlating events across servers

3. LossyMapReduce processing • Result eviction: trading fidelity for availability • Latency eviction • Returnincomplete results to meet the deadline • Fidelity eviction • Evict when the results meet the quality requirement

3. LossyMapReduce processing • Loadcancellation and shedding • Load cancellation • Internal nodes don’t waste cycles creating or merging panes that will never be used. • Load shedding • Prevent wasted effort when individual nodes are heavily loaded.

4. Prototype • Builds upon Mortar • a distributed stream processing system • Extendedto support • MapReduce API • Paned-based processing • Fault tolerance mechanisms

5. Evaluation • HDFS log analyze • User must decide whether that is an acceptable tradeoff

5. Evaluation • In-situ performance • Hadoop can improve job throughput. • And iMR can deliver useful results.

6. Conclusion • Log analysis steps • Dedicated clusters => Data source • Continuous in-situ processing • C2 framework • Trading fidelity for availability

In-situ MapReduce for Log Processing