260 likes | 420 Views
Scalable Data-intensive Analytics. Meichun Hsu Intelligent Information Management Lab HP Labs August 24, 2008. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Outline. Introduction Illustrated Computation Patterns
E N D
Scalable Data-intensive Analytics Meichun Hsu Intelligent Information Management Lab HP Labs August 24, 2008 Joint work with Qiming Chen, Bin Zhang, Ren Wu Joint work with Qiming Chen, Bin Zhang, Ren Wu
Outline Introduction Illustrated Computation Patterns Groupby with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work
BI Services delivered to broad base of users / applications OLTP systems, sensors, external feeds, web content… Business Operational Analytics Data Transformation Analytics Massively Parallel Data Warehouse Files / Tables Challenges in BI Analytics • Scaling of data-intensive analytics components have not kept pace • Plus new challenges: • Bigger and bigger data sets • More and more complex transformation and analysis • Demand on near real-time responses to enable Operational BI (OpBI) • Data-intensive transformation and operational analytics increasingly recognized as the bottleneck “In fact, it’s THE bottleneck in most VLDW/VLDB and very large data integration systems.”-
Challenges in BI Analytics (regarding a media mix problem)…“The result of a non-linear model of promotional lift as a function of media spend by channel, some coupon-related variables for each store is outstanding in terms of fit. The bad news is that generating the coefficients using our application server and SPSS takes about two weeks of CPU time…. is this the type of problem we can throw at a parallel DB on…? “ - Director, Research and Analysis, BonTon, April 2008 “With the vast amounts of data growing, we have realized the fact that we often have to move data across networks for analysis. It's actually going to be better if we can stay inside the database and move some of our computations down to the individual nodes on a (parallel data warehouse) box.” - Jim Goodnight, founder and CEO of SAS, October 2007
Number of HW Threads UltraSparcT2 Power4 80486 2005 2000 1990 1995 Available Parallelism Grows Exponentially • How will trends in multicores ease or exacerbate bottlenecks in current transformation/analytics components? • Will 100’s of cores in a server, and 10,000s of cores in a scale-out parallel data warehouse, present an opportunity? Courtesy: Anatasia Ailamaki, 2008
Implications Opportunity to design Massively Data-Parallel Analytics Layer to dramatically improve end-to-end BI performance with • enhanced software parallelism to take better advantage of explosion of hardware threads, • enhanced data locality to better optimize utilization of limited memory and data bandwidth
Parallel Query Engine vs Google’s Map Reduce • Both are elegant and successful parallel processing models • Parallel Query Engine • Rich patterns of execution (pipelining, composition, multiple source sets, integration with schema mgmt, to name a few) • Focused on built-in query operators; UDF as an exception • Google’s Map Reduce • Limited patterns of execution • Focused more on supporting user-supplied programs
Approach to Scalable Analytics for BI • Integrates high-performance parallel computation with parallel query processing • Leverage SQL’s schema management and declarative query language • Fuse declarative data access with computation functions in a scale-out shared-nothing parallel processing infrastructure • Create a highly parallel data flow-oriented infrastructure for data-intensive analytics 8 12 March 2014
Research Issues • Richer dependency semantics for UDF and flexibility for UDF optimization, e.g. • GroupBy with User-defined Aggregate • Structuring the computation - taking into account derivation and side effects • High performance implementation of parallel processing primitives • Efficient management of memory hierarchy in new architectures – e.g. multicore for high performance analytics • UDF execution environment – process boundaries, data flow considerations in new hybrid cluster environments • Enhance composability of User Defined Functions (UDFs) • Express a “process flow” using UDFs for ETL, information extraction, and information derivation Plan / Spec Root Independent parallelism Esp exchange Esp exchange UDF group by File scan Partitioned parallelism Execution ODBC ESP ESP ESP UDF UDF UDF DP22 DP22 DP22 DP21 DP21 DP21 9 12 March 2014
Outline Introduction Illustrated Computation Patterns GroupBy with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work
Find the positions of that minimize UDFs with GroupBy Example – K-Means Clustering Data points Center of clusters . Init: Starting from an initial position of the centers, Assign_Center: assigns each data point to the closest center Recalculatecenters: as the geometric center of each cluster Iterate until no change happens. K-means algorithm:
Map-Reduce K-means Init. Centroids Push down to Parallel database? Calc. centroid Assign Cluster Calc new cetroids Execution Recalculate Centroids (4) (3) Reduce the answer Aggregate per cluster_id Done? (1) Map . . . Assign center per datapoint (2) Hash-distribute by cluster_id . . . . . . Intermediate key-value pair is [cluster_id, x] Map: Cx=argmin (x-Ck)2 Aggregate over each center_id Reduce: Sp[k]=Σx[k], Qp[k]=Σx[k],Np[k]=Σ1 Aggregate per cluster Compute: C[k]=Σ Sp[k] /Σ Np[k]
Execution Recalculate Centroids (4) (3) UDF ssl, ssq, count the answer Aggregate per cluster_id (1) UDF assign_center . . . Assign center per datapoint (2) Hash-distribute by cluster_id . . . . . . K-Means by UDFs • UDF assign_cluster (datapoint, k_centroids) • For each datapoint compute its distances to all centroids in k_centroids, and assign it to the cluster with the closest centroid • UDF ssl (datapoint), ssq (datapoint) • Aggregate each data point to produce sufficient statistics ssl and ssq SELECT cluster_id, s = ssl(x), q = ssq(x), count (*) FROM (SELECT INTO temp x, cluster_id = assign_cluster(x, kc) FROM r) GROUPBY cluster_id
Assemble results ssl( ): aggregate function {init(); iterate(); final(); merge();} Select oid, ssl(x) From r Local aggregate Parallelism in Aggregate UDFs ssl.merge() ssl.iterate() ssl.iterate() ssl.iterate() tuple-wise apply tuple-wise apply tuple-wise apply partitioning data
Recalculate Centroids the answer ssl.merge() ssl.merge() Assemble results Assemble results ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply Hash-distr. by cluster_id . . . . . . . . . Parallel Aggregate UDF Plan in K-means But: This plan is very high in communication overhead
VS “merge” stage of Reduce { } { } { } “iterate” stage of Reduce { assign, Sums } { assign, Sums } { assign, Sums } …… Efficient Parallel Computation Plan? Recalculate Centroids (4) the answer (3) Reduce the answer Aggregate per cluster_id ssl.merge() (1) Map ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() (2) Hash-distribute by cluster_id Assign center per datapoint . . . Hash-distr. by cluster_id . . . . . . . . . . . . . . .
Calc new cetroids Done? Pushing UDF w GroupBy down to Partition Level SELECT cluster_id, s = ssl(x), q = ssq(x), count (*) FROM (SELECT INTO temp x, cluster_id = assign_cluster(x, kc) FROM r) GROUPBY cluster_id Init. Centroids For each partition Calc. centroid Assign Cluster do Aggregate-local per partition per cluster Each local-aggregate returns a table with one row for each groupby column these s.s. are much smaller than the data set itself Local rows are combined at global level for each groupby column Combine-global per cluster
Outline Introduction Illustrated Computation Patterns Groupby with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work
Hydraulic dynamics River Network Model of a river segment Upstream segments Downstream segment Condition at a segment at time t depends on own properties at time t and conditions at upstream segments at time t-1, calculated based on hydraulic dynamics Analytics over a Structured Data Set An Example - A prediction system for water resources
Computation involves multi-dimensional dependencies (spatial, temporal) a(t2) a downstream a b c b c f d e upstream d t2 e f t1 • Output – predicted properties of all river segments • Time series, for all segments • Water level • Volume • Flow velocity • Flow and sand discharge • Input – a table of all river segments • Geometric & environmental parameters • Topology tree • Rainfall, weather sensor data • Precipitation • Evaporation • Runoff yield • Soil erosion River segment tree (millions of river segments) (tens of thousands of time intervals)
P0 P2 P1 C21 C22 C13 C11 C12 hydro() hydro() hydro() Bar.merge() • For tree-structured data set, need to allow UDFs to be applied in specific order Bar.iterate() Bar.iterate() Bar.iterate() UDFs generally cannot be applied to tuples structured as a graph • Each tuple processed independently of other tuples in the set Parallelization of hydro() hydro( ): table function Select * FROM riverCROSS APPLY hydro(*)
Graph traversal in SQL Pre-order traversal P0 P2 SELECT * FROM river CONNECT BY PRIOR sid = parent_sid START WITH sid = ‘P0’ P1 C21 C22 C13 C11 C12 Post-order traversal P0 P2 P1 SELECT name, sid, parent_sid FROM river CONNECT BY sid = PRIOR parent_sid START WITH sid = ‘C12’ C21 C22 C13 C11 C12
Extend UDF with Traversal Control Forms for graph-structured computation • Apply a UDF f() to tree-structured data objects in post order SELECT * FROM river CROSS APPLY hydro(*) CONNECT BY sid = PRIOR ALL parent_sid START WITH is_leaf= “yes” hydro( ): table function { } P0 P2 P1 C21 C22 C13 C11 C12 Semantics: Apply hydro() starting with leaf river segments Then apply to non-leaf only when all upstream segments are applied
I1 P-Level 8 I1 H1 P-Level 7 1. Leveling H1 A19 D4 D4 P-Level 6 G1 A18 G1 C4 F2 C4 F2 A17 F2 F1 P-Level 5 E2 E1 P-Level 4 E3 A15 F1 B5 F1 A13 E3 E3 D4 D1 D2 D3 P-Level 3 D3 D3 A12 A16 A16 E2 E2 C2 C3 C4 P-Level 2 C1 D1 D1 C3 B4 A10 A10 C3 B4 B2 B1 B3 B5 P-Level 1 A19 A15 A16 A8 A1 A4 A5 A9 A10 A17 A6 A7 A12 A18 A13 B3 A6 B3 A7 A8 A9 A8 A9 C2 C2 P-Level 0 A5 B2 E1 C1 D2 A5 C1 B2 E1 D2 B1 B1 A1 A4 A1 A4 partition ‘0’ at level 3 0 A6 partition ‘000’ at level 2 000 001 00000 000000 partition ‘001’ at level 0 partition ‘00000’ at level 1 partition ‘000000’ at level 0 Parallel processing strategy for graph-structured computation 3. Distribution: Distribute partitions to servers with load balancing based on size of partition and levels of partitions 2. Partition as connected subgraphs - Keep track of metadata of each partition 4. Compute in parallel Each server sorts properly then process tuples in sort order, recording metadata for parent firing and for transmitting computed tuples to other servers
Summary and on-going work • Illustrate how parallel query processing and map-reduce paradigms can be enriched for advanced scalable analytics • Explore primitives tha allow explicit declaration of semantics and dependencies in analytic computation have potential • Discover important patterns • Devise efficient parallelization support in infrastructure • Additional on-going investigations • Importance of shared-nothing principle in shared-memory (multicore) architecture • Hybrid clusters and paradigms for data flow among heterogenous clusters • Goal: Combine general data flow-driven computing paradigm with data management infrastructure to achieve data-intensive analytics
Q&A Thank You!