240 likes | 408 Views
MOLAP on Cloud. Interactive, Cluster Data Warehouse. Hongwei Zhao, Xiaojun Ye. hwzhao73@gmail.com , yexj@mail.tsinghua.edu.cn. Tsinghua University . Motivation. Extend the cube model to support OLAP operations on Big Data: OLAP o perations Interactive queries. Outline.
E N D
MOLAP on Cloud Interactive, Cluster Data Warehouse Hongwei Zhao, Xiaojun Ye hwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn Tsinghua University
Motivation • Extend the cube model to support OLAP operations on Big Data: • OLAP operations • Interactive queries
Outline Cube modelling Building and querying Experimenting
Data Transform for Cube TPC-DS tables Star views Cube data User queries
A Simplified Cube Model Result Cube Instance * Cube Metadata C B A Cuboid Instance Dimension Instance Cuboid Instance Dimension Instance Dimension Instance AC Key AB Measure Node BC Key Member Key Measure Node Key Measure Cell Key Member Key Dimension Member ABC Base Cuboids
Example: TPC-DS Query7 selecti_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 fromstore_sales, customer_demographics, date_dim, item, promotion wheress_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id
Relation Schema Customer Demographics Date Dim Store Sales Promotion Item
Converting to BitKey Fact1 Fact2 Intermediate Result1 Result1 Result2 Intermediate Result1
Cube Storage • Table • Region • ColumnFamily • Row • Column • Version • Value Cell
MDX for query 7 • select{ i_item_id } on rows, { avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt), avg(ss_sales_price) } on columns fromstore_sales_cubewhere (cd_gender .[Male],cd_marital_status.[Single],cd_education_status.[College],d_year.[2000])
Cube Implementation Base cuboid building with 4 stages: Dimension constructing Hive query Aggregation Saving Query execution with 4 stages: Loading dimension Other cuboid constructing Mapping Reducing
Cluster Framework OLAP System Dispatcher Node Engine Worker Node Worker Node Dispatcher Node Worker Nodes Worker Node cachedata • Distribute dynamically cubes data onto worker nodes • Parallelize OLAP operations into a concurrent model Columnar Database Master Node Region Node Region Node Region Node Cube data
Actor of Akka State Behavior Mailbox Lifecycle Fault tolerance
Actors for Query • Load dimension members • Build other cuboids • Mapping • Reducing Execute Query require 2 Dimension load 1 Query Dispatcher Cuboid Manager Dimension Manager data ready Cuboid ready 4 Extract Query 3 Hit Cell Hit Cell Mapper Reducer
Compiling & Mapping • Query 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000 • GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single : 000001100ES Mask: 001100000 College: 001000000YEAR Mask: 110000000 2000: 010000000 Mapper1 Mask: 111111111FilterKey: 011001110 For each cell in mapper { If (key & mask == Filter Key) Send to Reducer } Query Dispatcher Mapper2 Mapper3
Query Execution Cache 1 • Master sends task messages to workers results Worker Region 1 • Each worker caches each region data Master • Sequential tasks reuse the cache data messages Cache 2 First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs Cache 3 Worker Region 2 Worker Region 3
Experiments On TPC-DS • 4 nodes: • 2*Intel Xeon CPU E5-2630 • 4*600G 15000r/s SAS • 256G RAM • 10Gb Network Dimensions: "i_item_id", "cd_gender", "cd_marital_status", "cd_education_status", "p_channel_email", "p_channel_event", "d_year“ Measures: ss_quantity_avg, ss_list_price_avg, ss_coupon_amt_avg, ss_sales_price_avg
Build Cube for Query 7 • Partition by the largest Dimension(i_item_id) • In-Memory aggregation • Saving stage can be ignore(cache)
Execute Query 7 • First execution on the cube includes • Dimension loading • other cuboids construction • Caching • Mapping • Reducing • Sequential execution includes: • Mapping • Reducing
Hive Query for Fact Data • selectp_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_pricefromstore_salesjoindate_dimon (store_sales.ss_sold_date_sk = date_dim.d_date_sk) join item on (store_sales.ss_item_sk = item.i_item_sk) joincustomer_demographicson (store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk) join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
Compare with Hive First query time compare: 2-3X Sequential execution time: 30-50X
Future work • Cube Model: • Demand-driven & Data-driven • Cube Data: • Model-driven & Requirement-driven • More experiments on TPC-DS queries • Report, ad hoc, iterative, data mining, • MDX/XMLA compliance
Storage for Example Table: Dimension Table: Cuboid_ABC