180 likes | 336 Views
Efficient Updates for a Shared Nothing Analytics Platform. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris { katerina , dtsouma , nkoziris }@ cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data
E N D
Efficient Updates for a Shared Nothing Analytics Platform KaterinaDoka, DimitriosTsoumakos, NectariosKoziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens
Motivation • Large volumes of data • Everyday life, science and business domain • Time-series data • Temporally ordered, organized in hierarchies (Day<Month<Year) • E.g., date of a credit card purchase, time of a phone call • Important for monitoring a process of interest • On-line processing • Fast retrieval – Point, range, aggregate queries • Detection of real time changes in trends • Intrusion or DoS detection, effects of product’s promotion • Online, cost-efficient updates
Up till now • Data Warehouses • Centralized, off-line approaches • Distributed warehousing systems • Functionality remains centralized • Distributed Warehouse-like initiative: Brown Dwarf • Distribution of centralized Dwarf • Deployed on shared-nothing, commodity hardware • Scalability, fault tolerance, performance • No special consideration for time-series data • Update procedure costly → unfit for frequent updates
Our Goals • Cloud based DataWarehousing-like system • Targeted to time-series data • Arriving at high rate • Store, update, query data at various granularity levels • Multidimensional, hierarchical • Shared nothing architecture • Commodity nodes • Without use of any proprietary tool • Java libraries, socket APIs
Our Contribution • Complete system for multidimensional time-series data • Store with one pass • Update online • Query efficiently • Point, aggregate • Various levels of granularity • Adaptive materialization • According to data recency • Accelerate cube creation/update • Minimize storage consumption
Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Any query (point or aggregate) is answered through traversal of structure
Brown Dwarf • Dwarf nodes mapped to overlay nodes • UID for each node • Hint tables of the form (currAttr, child) • Insertion • One-pass over the fact table • Gradual structure of hint tables • Queries • Overlay path of d hops • Incremental Updates • Elasticity through adaptive mirroring
Advantages and Drawbacks • Store even larger amounts of data! • Dwarf reduces but may also blow-up data • High dimensional, sparse >1,000 times • Handle many more requests • Query the system online • Accelerate creation (up to 5 times ) and querying (up to 60 times) • Parallelization • Update remains costly
Time Series Dwarf (TSD) • A concept hierarchy characterizes time • and any other dimension • Updates are applied in temporal order • Temporal granularity of queries relative to the time of querying • More detailed queries for recent events • More coarse grained queries for past events
TSD Operations - Insertion • Time first in order • Lack of ALL cell in Time • Aggregate created after completion of a level
TSD Operations - Querying • Follow path along the structure • Roll-up query for aggregate already created • Within d hops (e.g., <Y1, ALL, P1>) • Roll-up query for recent records • Initial query substituted by multiple lower level queries (e.g., <Y2, S1, P1>)
TSD Operations - Updating Insertion of a new tuple Longest common prefix with existing structure Underlying nodes recursively updated Lack of ALL cell for Time + temporal ordering = fewer existing cells affected Example: 3 TSD nodes vs. 12 Dwarf nodes affected
Adaptive Materialization • A daemon process asynchronously • creates roll-up views • deletes corresponding drill-down ones • The period of this process depends on application • Tradeoff: cube size vs. response accuracy
Experimental Evaluation • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) • Synthetic and real datasets • APB-1 Benchmark generator • 4-d, 3 levels for Time, various densities • DARPA Intrusion Detection audit data • 1M tuples, 7-d, 3 levels for Time • TSD: static mode • TSDad: adaptive mode
Cube Construction • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset) • Lack of the ALL cell in the first dimension • Acceleration of cube creation up to 89% compared to Dwarf • Better use of resources through parallelization (BD) • Further reduction due to lack of ALL and selective materialization
Updates • 10k updates • TSD up to 3 times faster than Dwarf and 30% faster than BD • Ordered updates – do not affect already created views • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction) • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%
Queries • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal • Q2: Recent records are queried upon in more detail (Zipfian) • Q3: Random • As queryset approximates uniform distribution • Message cost increases • Accuracy decreases