Efficient Updates for a Shared Nothing Analytics Platform

Efficient Updates for a Shared Nothing Analytics Platform KaterinaDoka, DimitriosTsoumakos, NectariosKoziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens

Motivation • Large volumes of data • Everyday life, science and business domain • Time-series data • Temporally ordered, organized in hierarchies (Day<Month<Year) • E.g., date of a credit card purchase, time of a phone call • Important for monitoring a process of interest • On-line processing • Fast retrieval – Point, range, aggregate queries • Detection of real time changes in trends • Intrusion or DoS detection, effects of product’s promotion • Online, cost-efficient updates

Up till now • Data Warehouses • Centralized, off-line approaches • Distributed warehousing systems • Functionality remains centralized • Distributed Warehouse-like initiative: Brown Dwarf • Distribution of centralized Dwarf • Deployed on shared-nothing, commodity hardware • Scalability, fault tolerance, performance • No special consideration for time-series data • Update procedure costly → unfit for frequent updates

Our Goals • Cloud based DataWarehousing-like system • Targeted to time-series data • Arriving at high rate • Store, update, query data at various granularity levels • Multidimensional, hierarchical • Shared nothing architecture • Commodity nodes • Without use of any proprietary tool • Java libraries, socket APIs

Our Contribution • Complete system for multidimensional time-series data • Store with one pass • Update online • Query efficiently • Point, aggregate • Various levels of granularity • Adaptive materialization • According to data recency • Accelerate cube creation/update • Minimize storage consumption

Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Any query (point or aggregate) is answered through traversal of structure

Brown Dwarf • Dwarf nodes mapped to overlay nodes • UID for each node • Hint tables of the form (currAttr, child) • Insertion • One-pass over the fact table • Gradual structure of hint tables • Queries • Overlay path of d hops • Incremental Updates • Elasticity through adaptive mirroring

Advantages and Drawbacks • Store even larger amounts of data! • Dwarf reduces but may also blow-up data • High dimensional, sparse >1,000 times • Handle many more requests • Query the system online • Accelerate creation (up to 5 times ) and querying (up to 60 times) • Parallelization • Update remains costly

Time Series Dwarf (TSD) • A concept hierarchy characterizes time • and any other dimension • Updates are applied in temporal order • Temporal granularity of queries relative to the time of querying • More detailed queries for recent events • More coarse grained queries for past events

TSD Operations - Insertion • Time first in order • Lack of ALL cell in Time • Aggregate created after completion of a level

TSD Operations - Querying • Follow path along the structure • Roll-up query for aggregate already created • Within d hops (e.g., <Y1, ALL, P1>) • Roll-up query for recent records • Initial query substituted by multiple lower level queries (e.g., <Y2, S1, P1>)

TSD Operations - Updating Insertion of a new tuple Longest common prefix with existing structure Underlying nodes recursively updated Lack of ALL cell for Time + temporal ordering = fewer existing cells affected Example: 3 TSD nodes vs. 12 Dwarf nodes affected

Adaptive Materialization • A daemon process asynchronously • creates roll-up views • deletes corresponding drill-down ones • The period of this process depends on application • Tradeoff: cube size vs. response accuracy

Experimental Evaluation • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) • Synthetic and real datasets • APB-1 Benchmark generator • 4-d, 3 levels for Time, various densities • DARPA Intrusion Detection audit data • 1M tuples, 7-d, 3 levels for Time • TSD: static mode • TSDad: adaptive mode

Cube Construction • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset) • Lack of the ALL cell in the first dimension • Acceleration of cube creation up to 89% compared to Dwarf • Better use of resources through parallelization (BD) • Further reduction due to lack of ALL and selective materialization

Updates • 10k updates • TSD up to 3 times faster than Dwarf and 30% faster than BD • Ordered updates – do not affect already created views • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction) • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%

Queries • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal • Q2: Recent records are queried upon in more detail (Zipfian) • Q3: Random • As queryset approximates uniform distribution • Message cost increases • Accuracy decreases

Questions

Efficient Updates for a Shared Nothing Analytics Platform

Efficient Updates for a Shared Nothing Analytics Platform

Presentation Transcript

Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms

Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes ADBI

Latest Updates of Lync Platform

Mobile Phones as a Shared Sensing Platform

Money For Nothing

A Platform for Scalable One-pass Analytics using MapReduce

MICROSTRATEGY ANALYTICS PLATFORM

Chimera: Data Sharing Flexibility, Shared Nothing Simplicity

Water Analytics Platform on AWS

Shared Nothing Architecture

A Platform for Innovation

Platform Updates

Efficient Mixed-Platform Clouds

Efficient Application Placement in a Dynamic Hosting Platform

FoxMetrics Web Analytics Platform

Something for Nothing

Clickticker - Online Analytics Platform - Affiliate Tracking Platform

analytics platform for big data

Self Service Data Analytics Platform

Competitive Advanced Analytics Platform

GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB