1 / 24

MOLAP on Cloud

MOLAP on Cloud. Interactive, Cluster Data Warehouse. Hongwei Zhao, Xiaojun Ye. hwzhao73@gmail.com , yexj@mail.tsinghua.edu.cn. Tsinghua University . Motivation. Extend the cube model to support OLAP operations on Big Data: OLAP o perations Interactive queries. Outline.

kimo
Download Presentation

MOLAP on Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MOLAP on Cloud Interactive, Cluster Data Warehouse Hongwei Zhao, Xiaojun Ye hwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn Tsinghua University

  2. Motivation • Extend the cube model to support OLAP operations on Big Data: • OLAP operations • Interactive queries

  3. Outline Cube modelling Building and querying Experimenting

  4. Data Transform for Cube TPC-DS tables Star views Cube data User queries

  5. A Simplified Cube Model Result Cube Instance * Cube Metadata C B A Cuboid Instance Dimension Instance Cuboid Instance Dimension Instance Dimension Instance AC Key AB Measure Node BC Key Member Key Measure Node Key Measure Cell Key Member Key Dimension Member ABC Base Cuboids

  6. Example: TPC-DS Query7 selecti_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 fromstore_sales, customer_demographics, date_dim, item, promotion wheress_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id

  7. Relation Schema Customer Demographics Date Dim Store Sales Promotion Item

  8. Converting to BitKey Fact1 Fact2 Intermediate Result1 Result1 Result2 Intermediate Result1

  9. Cube Storage • Table • Region • ColumnFamily • Row • Column • Version • Value Cell

  10. MDX for query 7 • select{ i_item_id } on rows, { avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt), avg(ss_sales_price) } on columns fromstore_sales_cubewhere (cd_gender .[Male],cd_marital_status.[Single],cd_education_status.[College],d_year.[2000])

  11. Cube Implementation Base cuboid building with 4 stages: Dimension constructing Hive query Aggregation Saving Query execution with 4 stages: Loading dimension Other cuboid constructing Mapping Reducing

  12. Cluster Framework OLAP System Dispatcher Node Engine Worker Node Worker Node Dispatcher Node Worker Nodes Worker Node cachedata • Distribute dynamically cubes data onto worker nodes • Parallelize OLAP operations into a concurrent model Columnar Database Master Node Region Node Region Node Region Node Cube data

  13. Actor of Akka State Behavior Mailbox Lifecycle Fault tolerance

  14. Actors for Query • Load dimension members • Build other cuboids • Mapping • Reducing Execute Query require 2 Dimension load 1 Query Dispatcher Cuboid Manager Dimension Manager data ready Cuboid ready 4 Extract Query 3 Hit Cell Hit Cell Mapper Reducer

  15. Compiling & Mapping • Query 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000 • GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single : 000001100ES Mask: 001100000 College: 001000000YEAR Mask: 110000000 2000: 010000000 Mapper1 Mask: 111111111FilterKey: 011001110 For each cell in mapper { If (key & mask == Filter Key) Send to Reducer } Query Dispatcher Mapper2 Mapper3

  16. Query Execution Cache 1 • Master sends task messages to workers results Worker Region 1 • Each worker caches each region data Master • Sequential tasks reuse the cache data messages Cache 2 First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs Cache 3 Worker Region 2 Worker Region 3

  17. Experiments On TPC-DS • 4 nodes: • 2*Intel Xeon CPU E5-2630 • 4*600G 15000r/s SAS • 256G RAM • 10Gb Network Dimensions: "i_item_id", "cd_gender", "cd_marital_status", "cd_education_status", "p_channel_email", "p_channel_event", "d_year“ Measures: ss_quantity_avg, ss_list_price_avg, ss_coupon_amt_avg, ss_sales_price_avg

  18. Build Cube for Query 7 • Partition by the largest Dimension(i_item_id) • In-Memory aggregation • Saving stage can be ignore(cache)

  19. Execute Query 7 • First execution on the cube includes • Dimension loading • other cuboids construction • Caching • Mapping • Reducing • Sequential execution includes: • Mapping • Reducing

  20. Hive Query for Fact Data • selectp_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_pricefromstore_salesjoindate_dimon (store_sales.ss_sold_date_sk = date_dim.d_date_sk) join item on (store_sales.ss_item_sk = item.i_item_sk) joincustomer_demographicson (store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk) join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)

  21. Compare with Hive First query time compare: 2-3X Sequential execution time: 30-50X

  22. Future work • Cube Model: • Demand-driven & Data-driven • Cube Data: • Model-driven & Requirement-driven • More experiments on TPC-DS queries • Report, ad hoc, iterative, data mining, • MDX/XMLA compliance

  23. Thanks.

  24. Storage for Example Table: Dimension Table: Cuboid_ABC

More Related