MOLAP on Cloud

MOLAP on Cloud Interactive, Cluster Data Warehouse Hongwei Zhao, Xiaojun Ye hwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn Tsinghua University

Motivation • Extend the cube model to support OLAP operations on Big Data: • OLAP operations • Interactive queries

Outline Cube modelling Building and querying Experimenting

Data Transform for Cube TPC-DS tables Star views Cube data User queries

A Simplified Cube Model Result Cube Instance * Cube Metadata C B A Cuboid Instance Dimension Instance Cuboid Instance Dimension Instance Dimension Instance AC Key AB Measure Node BC Key Member Key Measure Node Key Measure Cell Key Member Key Dimension Member ABC Base Cuboids

Example: TPC-DS Query7 selecti_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 fromstore_sales, customer_demographics, date_dim, item, promotion wheress_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id

Relation Schema Customer Demographics Date Dim Store Sales Promotion Item

Converting to BitKey Fact1 Fact2 Intermediate Result1 Result1 Result2 Intermediate Result1

Cube Storage • Table • Region • ColumnFamily • Row • Column • Version • Value Cell

MDX for query 7 • select{ i_item_id } on rows, { avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt), avg(ss_sales_price) } on columns fromstore_sales_cubewhere (cd_gender .[Male],cd_marital_status.[Single],cd_education_status.[College],d_year.[2000])

Cube Implementation Base cuboid building with 4 stages: Dimension constructing Hive query Aggregation Saving Query execution with 4 stages: Loading dimension Other cuboid constructing Mapping Reducing

Cluster Framework OLAP System Dispatcher Node Engine Worker Node Worker Node Dispatcher Node Worker Nodes Worker Node cachedata • Distribute dynamically cubes data onto worker nodes • Parallelize OLAP operations into a concurrent model Columnar Database Master Node Region Node Region Node Region Node Cube data

Actor of Akka State Behavior Mailbox Lifecycle Fault tolerance

Actors for Query • Load dimension members • Build other cuboids • Mapping • Reducing Execute Query require 2 Dimension load 1 Query Dispatcher Cuboid Manager Dimension Manager data ready Cuboid ready 4 Extract Query 3 Hit Cell Hit Cell Mapper Reducer

Compiling & Mapping • Query 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000 • GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single : 000001100ES Mask: 001100000 College: 001000000YEAR Mask: 110000000 2000: 010000000 Mapper1 Mask: 111111111FilterKey: 011001110 For each cell in mapper { If (key & mask == Filter Key) Send to Reducer } Query Dispatcher Mapper2 Mapper3

Query Execution Cache 1 • Master sends task messages to workers results Worker Region 1 • Each worker caches each region data Master • Sequential tasks reuse the cache data messages Cache 2 First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs Cache 3 Worker Region 2 Worker Region 3

Experiments On TPC-DS • 4 nodes: • 2*Intel Xeon CPU E5-2630 • 4*600G 15000r/s SAS • 256G RAM • 10Gb Network Dimensions: "i_item_id", "cd_gender", "cd_marital_status", "cd_education_status", "p_channel_email", "p_channel_event", "d_year“ Measures: ss_quantity_avg, ss_list_price_avg, ss_coupon_amt_avg, ss_sales_price_avg

Build Cube for Query 7 • Partition by the largest Dimension(i_item_id) • In-Memory aggregation • Saving stage can be ignore(cache)

Execute Query 7 • First execution on the cube includes • Dimension loading • other cuboids construction • Caching • Mapping • Reducing • Sequential execution includes: • Mapping • Reducing

Hive Query for Fact Data • selectp_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_pricefromstore_salesjoindate_dimon (store_sales.ss_sold_date_sk = date_dim.d_date_sk) join item on (store_sales.ss_item_sk = item.i_item_sk) joincustomer_demographicson (store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk) join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)

Compare with Hive First query time compare: 2-3X Sequential execution time: 30-50X

Future work • Cube Model: • Demand-driven & Data-driven • Cube Data: • Model-driven & Requirement-driven • More experiments on TPC-DS queries • Report, ad hoc, iterative, data mining, • MDX/XMLA compliance

Thanks.

Storage for Example Table: Dimension Table: Cuboid_ABC

MOLAP on Cloud

MOLAP on Cloud

Presentation Transcript

Computing on the Cloud

Standardization on Cloud Computing

MySQL on the cloud

Car on the Cloud

ColdFusion On The Cloud

GINI on a Cloud

“Castle on a cloud”

Computing on the Cloud

CernVM on the Cloud

Banking on the Cloud

Bioinformatics on Cloud Cyberinfrastructure

HP on cloud today

IEEE International Conference on Cloud Computing (CLOUD)

Running on a Cloud

“Cloud bursting” on SZTAKI Cloud

Get on Cloud Nine with Cloud-Based eLearning

Business On A Cloud

Update on Computing/Cloud

Tally on Cloud Offers

An Overview on Cloud Hosting

Tally on cloud ppt

Bsoft Busy on Cloud: Best Busy cloud provider