Data Indexing for Stateful , Large-scale Data Processing

Data Indexing for Stateful, Large-scale Data Processing Dionysios Logothetis, Kenneth Yocum University of California, San Diego

Processing large-scale data today MapReduce (Dryad/Hadoop) • Scalable/fault tolerant bulk-data processing • Groupwise processing: embarrassingly parallel workloads • General: supports relational queries too, e.g. join two datasets Parallel DBs • 20 years of work • Fast and efficient for joins [Pavlo et al., SIGMOD ’09] Really two philosophies • DBs: structured data that is preloaded • Allows indexing • MapReduce: grab data, process, use the result • Indexing may be wasteful

The case for stateful bulk processing Incremental processing • On bulk data, continuously arriving, e.g. web crawls • State of the art is to recompute when data changes • Grossly inefficient Key idea: Incorporating state in bulk-processing Challenge: efficient statefulgroupwise processing • Incorporate into programming model • Efficient architecture, fast access to state

Bulk-incremental Processing Systems (BIPS) • Supports stateful computations • User-defined function G(∙) • Multiple input and output flows • Access to persistent state • Modeled as a loopback flow • Input and state grouped by key • G(∙) called for every key Single processing stage k v k v Input records … State Fin2 Fin1 G(key, Fstate, ΔF1, ΔF2) Fin state Fout2 Fout1 k1 v1 k2 v2 Fout state … kn-1 vn-1 kn vn

Statefulgroupwise processing state flow input flow key Input flow Count 2 2 1 G(green, Ø , ) State G(blue , Ø , ) G(red , Ø , )

Statefulgroupwise processing state flow input flow key Input flow Count 2 2 1 G(green, Ø , ) State G(blue , Ø , ) G(red , Ø , ) Input flow Count G(green, , Ø ) 1 G(blue , , ) 2 2 1 2 2 3 1 G(red , , Ø ) State 2

Inner-grouping with state • Current models: • Support only outer-grouping • Read whole state, call G() for every key • BIPS model allows inner-grouping • Call G() only if there is an input key matching • Use input to select what state to update Input flow Count G(blue, , ) 2 2 1 2 3 State

Storing state in tables Count • Storing state to a file • Forces stage to read whole state • HDFS / GFS • Maintain indexed state • Selectively access based on input • Avoid unnecessary data transfers • Bigtable[Chang et al., OSDI’06] • Stages store state in table • Indexed by state key 2 2 3 2 1 1 File Randomly reading part of the state must be faster than sequentially reading the whole state Count 2 2 2 3 1 Table

BIPS prototype • Leverage Hadoop • Modify to support • Statefulgroupwise processing • Inner-grouping • … and others • Hypertable for storing state • Open source “Bigtable”

Using table-based storage • What workloads benefit from the index? • Incremental count, 1M state records, store state on HDFS or Hypertable • Break-even at 17% Index helps only a small range of workloads

Predicting the benefit of an index • What type of workloads benefit more? • What is the random read rate required? • Simple cost model: • Running time T of an operator • Depends on • Random-to-sequential read rate ratio: Rran /Rseq • % of state accessed: h • T = Tread,I + Tsort,I + Tread,S + Twrite,S Read input Sort input Read state Write state Time to randomly access h∙N records Time to sequentially access N records …or…

Predicting the benefit of an index • Fix random-to-sequential throughput • What’s the maximum % of state accessed for which there is a gain? Break-even 50% gain • BT helps if less than 20% of state accessed More workloads benefit BT 20% Random reads become as fast as sequential

Leveraging Solid State Disks • Table stores are built on top of magnetic disks • Random read rate 1 order of magnitude lower than sequential • SSDs improve random read performance • 200x higher than magnetic disks • Good candidate for serving indexes

Random state access on an SSD • Developed proof-of-concept indexed storage system • Increased random-to-sequential read ratio to 37% Break-even 50% gain • Break-even at 65% state accessed • SSD raw performance leaves room for wider range of workloads 65% More workloads benefit BT 20% SSD SSD raw Random reads become as fast as sequential

SSD cost efficiency • SSDs are good for implementing indexes… but they are expensive • Cost per capacity ($/GB) is high • 30 times higher than magnetic disks • Cost per bandwidth ($/MB/s) is low • 50 times lower than magnetic disks • Cost efficiency: cost per performance • Cost: price per capacity, C • Performance: job throughput 1/Tread,S • System I: • Runs on an HD • Sequentially accesses N records • System II: • Runs on an SSD • Randomly accesses h∙N records CHD CSSD = Rseq,HD Rran,SSD/h

SSD cost efficiency • Cost efficiency depends on • Cost ratio • % of state accessed • Random read rate CHD CSSD = Rseq,HD Rran,SSD/h Required random read rate so that SSD is more cost-efficient Relative cost today: 30x Cost-efficient for <5% state accessed Relative cost in 5 years: 2x FusionIO Cost-efficient for <70% state accessed

Summary • DBs use indexes to speedup operations • Bulk-incremental processing can benefit too • Model for stateful bulk processing • Allows the use of indexes • Table stores on magnetic disks do not perform well • Leverage SSDs for better random reads

Data Indexing for Stateful , Large-scale Data Processing

Data Indexing for Stateful , Large-scale Data Processing

Presentation Transcript

Mass Data Processing Technology on Large Scale Clusters

Data Indexing

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Sailfish: A Framework For Large Scale Data Processing

Large-Scale Data Processing with MapReduce

Large scale genomic data mining

Data Indexing

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Stateful Data Types

Large-Scale Iterative Data Processing CS525 Big Data Analytics

Cache-aware batch scheduling policies for large-scale scientific data processing

Large scale data processing

Large Scale Data Processing with DryadLINQ

Unstructured Data Partitioning for Large Scale Visualization

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

New Challenges for Large-scale Data Storage

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters