A Fully Distributed, Fault-Tolerant Data Warehousing System

A Fully Distributed, Fault-Tolerant Data Warehousing System KaterinaDoka, DimitriosTsoumakos, NectariosKoziris Computing Systems Laboratory National Technical University of Athens

Motivation D. Tsoumakos, HDMS 2010 • Large volumes of data • Everyday life (Web 2.0) • Science (LHC, NASA) • Business domain (automation, digitization, globalization) • New regulations – log/digitize/store everything • Sensors • Immense production rates • Distributed by nature

Motivation (contd.) D. Tsoumakos, HDMS 2010 • Demand for on always-on analytics • Store huge datasets • Both structured and semi-structured bulk data • Detection of real time changes in trends • Fast retrieval – Point, range, aggregate queries • Intrusion or DoS detection, effects of product’s promotion • Online, near real-time updates • From various locations, at big rates

(Up till) now D. Tsoumakos, HDMS 2010 • Traditional Data Warehouses • Vast amounts of historical data – data cubes • Centralized, off-line approaches • Querying vs. Updating • Distributed warehousing systems • Functionality remains centralized • Cloud Infrastructures • Resource as a service • Elasticity, commodity hardware • Pay-as-you-go pricing model

Our Goal D. Tsoumakos, HDMS 2010 • Distributed DataWarehousing-like system • Store, query, update • Multi-d, hierarchical • Scalable, always-on • Shared-nothing architecture • Commodity nodes • No proprietary tool needed • Java libraries, socket APIs

Brown Dwarf in a nutshell D. Tsoumakos, HDMS 2010 • Complete system for datacubes • Distributed storage • Online updates • Efficient query resolution • Point, aggregate • Various levels of granularity • Elastic resources according to • Workload skew • Node churn

Dwarf • Centralized structure with d levels • Root contains all distinct values of first dimension • Each cell points to node of the next level D. Tsoumakos, HDMS 2010 Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies

Why distribute it? D. Tsoumakos, HDMS 2010 • Store larger amounts of data • Dwarf may reduce but may also blow-up data • High dimensional, sparse >1,000 times • Update and query the system online • Accelerate creation, query and update speed • Parallelization • What about… • Failures, load-balancing, comm. costs? • Performance

Brown Dwarf (BD) Overview D. Tsoumakos, HDMS 2010 • Dwarf nodes mapped to overlay nodes • UID for each node • Hint tables of the form (currAttr, child) • Resolve/update along network path • Mirrors on per-node basis

BD Operations – Insert+Query D. Tsoumakos, HDMS 2010 • One-pass over the fact table • Gradual structure of hint tables • Creation of cell → insertion of currAttr • Creation of dearf node → registration of child • Follow path (d hops)along the structure

BD Operations - Update D. Tsoumakos, HDMS 2010 • Longest common prefix with existing structure • Underlying nodes recursively updated • Nodes expanded with new cells • New nodes created • ALL cells affected

Elasticity of Brown Dwarf D. Tsoumakos, HDMS 2010 • Static and adaptive replication vs: • Load (min/max load) • Churn (require≥k replicas) • Local only interactions • Ping/exchange hint Tables for consistency • Query forwarding to balance load

Experimental Evaluation D. Tsoumakos, HDMS 2010 • 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) • Synthetic and real datasets • 5d-25d, various levels of skew (Zipf θ=0.95) • APB-1 Benchmark generator • Forest and Weather datasets • Simulation results with 1000s nodes

Cube Construction D. Tsoumakos, HDMS 2010 • Acceleration of cube creation up to 3.5 times compared to Dwarf • Better use of resources through parallelization • More noticeable effect for high dimensional, skewed datasets • Storage overhead • Mainly attributed to mapping between dwarf node and network IDs • Shared among network nodes

Updates D. Tsoumakos, HDMS 2010 • 1% updates • Up to 2.3 times faster for skewed dataset • Dimensionality increases the cost

Queries D. Tsoumakos, HDMS 2010 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1

Elasticity • 10-d 100k datasets, 5k query-sets • λ=10qu/sec → 100qu/sec • BD adapts according to demand → elasticity • k=3, Nfail failing nodes every Tfail sec • 5k queries, 10-d uniform dataset • No loss for Nfail < k+1 • Query time increases due to redirections Dimitrios Tsoumakos, UoI Talk

What have we achieved so far? D. Tsoumakos, HDMS 2010 • BD optimizations – work in progress • Replication units (chunks, …), • Hierarchies – faster updates (MDAC 2010), … • Brown Dwarf focuses on • +Efficient answering of aggregate queries • +Cloud - friendly • - Preprocessing • - Costly updates • HiPPIS project • +Explicit support for Hierarchical data • +No preprocessing • +Ease of insertion and updates • - Processing for aggregate queries

Questions D. Tsoumakos, HDMS 2010

A Fully Distributed, Fault-Tolerant Data Warehousing System