1 / 18

Efficient Updates for a Shared Nothing Analytics Platform

Efficient Updates for a Shared Nothing Analytics Platform. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris { katerina , dtsouma , nkoziris }@ cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data

hazel
Download Presentation

Efficient Updates for a Shared Nothing Analytics Platform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Updates for a Shared Nothing Analytics Platform KaterinaDoka, DimitriosTsoumakos, NectariosKoziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens

  2. Motivation • Large volumes of data • Everyday life, science and business domain • Time-series data • Temporally ordered, organized in hierarchies (Day<Month<Year) • E.g., date of a credit card purchase, time of a phone call • Important for monitoring a process of interest • On-line processing • Fast retrieval – Point, range, aggregate queries • Detection of real time changes in trends • Intrusion or DoS detection, effects of product’s promotion • Online, cost-efficient updates

  3. Up till now • Data Warehouses • Centralized, off-line approaches • Distributed warehousing systems • Functionality remains centralized • Distributed Warehouse-like initiative: Brown Dwarf • Distribution of centralized Dwarf • Deployed on shared-nothing, commodity hardware • Scalability, fault tolerance, performance • No special consideration for time-series data • Update procedure costly → unfit for frequent updates

  4. Our Goals • Cloud based DataWarehousing-like system • Targeted to time-series data • Arriving at high rate • Store, update, query data at various granularity levels • Multidimensional, hierarchical • Shared nothing architecture • Commodity nodes • Without use of any proprietary tool • Java libraries, socket APIs

  5. Our Contribution • Complete system for multidimensional time-series data • Store with one pass • Update online • Query efficiently • Point, aggregate • Various levels of granularity • Adaptive materialization • According to data recency • Accelerate cube creation/update • Minimize storage consumption

  6. Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Any query (point or aggregate) is answered through traversal of structure

  7. Brown Dwarf • Dwarf nodes mapped to overlay nodes • UID for each node • Hint tables of the form (currAttr, child) • Insertion • One-pass over the fact table • Gradual structure of hint tables • Queries • Overlay path of d hops • Incremental Updates • Elasticity through adaptive mirroring

  8. Advantages and Drawbacks • Store even larger amounts of data! • Dwarf reduces but may also blow-up data • High dimensional, sparse >1,000 times • Handle many more requests • Query the system online • Accelerate creation (up to 5 times ) and querying (up to 60 times) • Parallelization • Update remains costly

  9. Time Series Dwarf (TSD) • A concept hierarchy characterizes time • and any other dimension • Updates are applied in temporal order • Temporal granularity of queries relative to the time of querying • More detailed queries for recent events • More coarse grained queries for past events

  10. TSD Operations - Insertion • Time first in order • Lack of ALL cell in Time • Aggregate created after completion of a level

  11. TSD Operations - Querying • Follow path along the structure • Roll-up query for aggregate already created • Within d hops (e.g., <Y1, ALL, P1>) • Roll-up query for recent records • Initial query substituted by multiple lower level queries (e.g., <Y2, S1, P1>)

  12. TSD Operations - Updating Insertion of a new tuple Longest common prefix with existing structure Underlying nodes recursively updated Lack of ALL cell for Time + temporal ordering = fewer existing cells affected Example: 3 TSD nodes vs. 12 Dwarf nodes affected

  13. Adaptive Materialization • A daemon process asynchronously • creates roll-up views • deletes corresponding drill-down ones • The period of this process depends on application • Tradeoff: cube size vs. response accuracy

  14. Experimental Evaluation • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) • Synthetic and real datasets • APB-1 Benchmark generator • 4-d, 3 levels for Time, various densities • DARPA Intrusion Detection audit data • 1M tuples, 7-d, 3 levels for Time • TSD: static mode • TSDad: adaptive mode

  15. Cube Construction • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset) • Lack of the ALL cell in the first dimension • Acceleration of cube creation up to 89% compared to Dwarf • Better use of resources through parallelization (BD) • Further reduction due to lack of ALL and selective materialization

  16. Updates • 10k updates • TSD up to 3 times faster than Dwarf and 30% faster than BD • Ordered updates – do not affect already created views • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction) • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%

  17. Queries • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal • Q2: Recent records are queried upon in more detail (Zipfian) • Q3: Random • As queryset approximates uniform distribution • Message cost increases • Accuracy decreases

  18. Questions

More Related