360 likes | 469 Views
CS 410/510 Data Streams Lecture 12: Stream Warehousing with DataDepot. Kristin Tufte, David Maier Thanks to: Theodore Johnson; Lukasz Golab; Spence Seidel; Vladislav Shkapenyuk AT&T Labs - Research. DSMS vs. Stream Warehouse. DSMS Temporal data (append only) Best effort consistency
E N D
CS 410/510 Data StreamsLecture 12: Stream Warehousing with DataDepot Kristin Tufte, David Maier Thanks to: Theodore Johnson; Lukasz Golab; Spence Seidel; Vladislav Shkapenyuk AT&T Labs - Research Data Streams: Lecture 12
DSMS vs. Stream Warehouse • DSMS • Temporal data (append only) • Best effort consistency • Real-time response • Stream Warehouse • 10s to 100s of TB of historical data • Time windows measured in years or decades • Real-time queries on recent data • Deep analyses on historical data Data Streams: Lecture 12
Stream Warehousing • Real-time warehouse sourced from data streams. • All data has a timestamp (perhaps more) • Real-time updates, multi-year window. • Deeply nested materialized views. • Propagate incremental updates. • DSMS in slow motion dj d1 s1 b1 sn bn di dk Derived tables Raw tables Data Streams: Lecture 12
DataDepot – Key Features • ETL – Raw tables are non-materialized views • Timestamp-based horizontal partitioning • Multiple timestamps -> timestamp correlation • Partition dependencies for updates • Real-time scheduling (table priorities) • Consistency issues (leading/trailing edge) • Warehouse dashboard & Data quality tools Data Streams: Lecture 12
DataDepot Architecture Data Streams: Lecture 12
Partitioning • Temporal partitioning for temporal tables. • Roll in at the leading edge, roll off at the trailing edge • Set partition size to be update increment • Avoid expensive incremental updates, index rebuilding • Correlate non-partitioning temporal fields with the partitioning field window Roll out update StartTime=55 EndTime=55 Data Streams: Lecture 12
Raw Tables • Non-materialized view of source data files • Except for indirect/loaded-to tables • File finding • Where are files? • File classification • Filename metadata • Data extraction • gzcat • Schema (create table) Data Streams: Lecture 12
Direct Raw Tables • Non-materialized view • Temporal partitioning attribute • Timestamps in file name • Partition directory • Maps partitions to data files • New data/files -> update to partition directory • Records extracted when updating derived table Data Streams: Lecture 12
Indirect Raw Tables • Timestamps in data in file • Loaded-to table • Extract data only once • Efficient append-only update • Similar partition table structure to direct raw tables Data Streams: Lecture 12
Derived Tables • Materialized views (EMF-SQL query) • Query; indices; partitioning function; priority • Derived table partitions recomputed (not updated) • Partitions should rarely be recomputed • Precisely identify partitions to be recomputed • Specify mapping between source and destination partitions (SLB, SUB) Data Streams: Lecture 12
Source Dependencies Data Streams: Lecture 12
Update Propagation • Basic algorithm: recompute partitions • In general, most of the partition is affected. • Non-SQL views : outputs of analyses • Determine the source partitions of a derived partition • Recompute if a source changes • Eventual Consistency update D S • Partition on timestamp (temporal partitioning attribute) • Partitions marked with last-update timestamp Data Streams: Lecture 12
Issues & Featues • Data loading architecture (files & queries) • Data generated externally • No user updates or inserts • Provenance is documented (queries) • Declarative specification enables optimization • Late data • Most data arrive in order (direct raw tables) • Partitions computed only once • Late data motivates indirect raw tables Data Streams: Lecture 12
Real-time Warehousing • Provide real-time data as well as historical perspective. • Important warehouse features • Multi-Version Concurrency Control (MVCC) • Multiple-granularity partitioning • Real-time aware resource control • Golab et al., Scheduling Updates in a Real-Time Stream Warehouse. ICDE 2009 Data Streams: Lecture 12
MVCC • Continuous updates: don’t block queries • Or other updates • Simple implementation • Single-updater, multiple reader • Swap in a new partition directory 1-day summaries old new Real-time table Garbage collected later Data Streams: Lecture 12
Temporal Correlation • Typical to have multiple timestamps • Record represents interval -> start/end • Timestamps assigned by several entities • Multiple timestamps often highly correlated • Can specify temporal correlations in derived tables • Query rewrites • Convert range predicate on correlated temporal attributes to range predicate on partitioning attribute Data Streams: Lecture 12
Multi-granularity Partitioning • Small partitions for new data, large partitions for old data • Well-known in the folklore • Two or more levels of granularity • Maintain as separate tables • Present a coherent combined view • Separate storage, indices • Query execution is granularity-cognizant Combined table Rollup Data Streams: Lecture 12
Real-time Resource control • How should we schedule Derived table updates? • Lazy updates, recompute at query time X • May require huge computations across multiple tables • Not compatible with a real-time warehouse • Eager (immediate) updates X • Dozens of data feeds, hundreds of tables • Avoid over-subscription of server resources • Memory, disk-arm thrashing, context-switch thrashing • Ensure that resources are available for queries • Complicating factors • Parallel scheduling, overload, priorities, table hierarchies, non-preemptive scheduling. Data Streams: Lecture 12
Table Staleness Si(t) 6 5 4 3 2 1 t 1 2 3 4 5 6 7 8 9 10 11 Update arrives (data up to t=3) Update loaded Update arrives (data between t=3 and t=7) Update arrives (data between t=7 and t=9) Update loaded Data Streams: Lecture 12
Real-time Aware Update Manager • Abstract processing resources as tracks • CPU, memory, disk bandwidth, etc. • Partition m jobs among n tracks • Don’t let long-running task block short-period tasks • Schedule each track set independently • Prioritized Earliest Deadline First • Update chopping • Break up updates of long-deferred tables • Overload, feed problems, etc. J1 J8 J9 J10 J2 J3 J4 J5 J6 J7 Track 1 Track 2 Track 3 Short, frequent jobs Long, infrequent jobs Data Streams: Lecture 12
Consistency Leading Edge • DataDepot provides eventual consistency • All data derived from raw sources • If all updates get propagated, will converge • What happens before eventually? • Leading edge consistency : use all the data that is available • Trailing edge consistency : use only stable data • But when is the data stable? Trailing Edge S1 D S2 Data Streams: Lecture 12
Streaming Data Quality • Dozens of data feeds collected world-wide • How do we know which data are “correct” and which are “dirty”? • Integrity constraints define correct data • Constraint violations = dirty data • Exploratory data mining via integrity constraints • “Try out” a constraint on the data and see which records satisfy or fail it • Data Auditor system for analyzing the quality of a DataDepot streaming warehouse Data Streams: Lecture 12
Data Quality Analysis 00:00 00:05 00:10 00:15 00:20 • Want to find patterns in tuples that satisfy or fail a constraint (rather than listing all violations) • Example: • table ROUTER_CPU_COUNTS stores the number of CPU utilization polls, call it num_polls, returned by each router in each 5-minute time interval • Expected behaviour: num_polls = 1 for each row of ROUTER_CPU_COUNTS Data Streams: Lecture 12
Data Quality Analysis ROUTER_CPU_COUNTS Constraint: Forall t in ROUTER_CPU_COUNTS, num_polls > 0 Summary of violating subsets with 25% or under polling rate (‘-’ denotes a wildcard pattern) Data Streams: Lecture 12
Other Types of Constraints • Sequential dependencies • Golab et al., Sequential dependencies, VLDB 2009 • “Time gaps” between successive CPU utilization polls should be between G1 and G2 Data Streams: Lecture 12
DataDepot Applications • Darkstar • Collection of feeds related to the core network • Scripts over feeds → queries over tables • Time to get answers: weeks → minutes • Internet measurement experiments become repeatable • Other applications: production • Performance management operation support • Warehouse backend for GS Tool installations • 5+ projects • Highest volume: 2GB/minute and growing Data Streams: Lecture 12
Applications of Darkstar • The point of the warehouse is to support critical analysis and operations tasks • PathMiner • Analyze performance problems on the path between two endpoints • NICE • Network-wide Information, Correlation & Exploration • Mine for root causes of network problems Data Streams: Lecture 12
Customerlocation B Events AT&T Customerlocation A PathMiner: Customer Troubleshooting Tool Kobus Van der Merwe, Seungjoon Lee, Carsten Lund • Challenge: Reduce Tier 3 workload investigating customer-impacting events • PathMiner : automatically identify network events between customer locations (temporal and spatial correlation) • Rapid analysis using wide range of data sources • Accurate (compared with human “guesstimates”) Collects interestingevents along pathbetween customerlocations Data Streams: Lecture 12
Interface down Bundle down PathMiner: SYSLOG Events along path Data Streams: Lecture 12
Network-wide Information Correlation & Exploration Mahimkar, et al.,Troubleshooting Chronic IP Conditions in a Large IP Network, CoNEXT ’08. • Statistical correlation analysis across multiple data sets • Recurring condition manifests in many measurements • “Blind” data mining: the analyst drowns in data • NICE starts with symptom and identifies correlated events Statistically Correlated Events NICE Recurring Symptom Spatial Proximity model Unified Data Model Statistical Correlation Other Network Events Data Streams: Lecture 12
Four interfaces with very high correlations with uplink overruns: all same customer Example: dlstx34c3 Uplink Overruns Input time seriesand event count CorrelationScores Correlated timeseries Event Series Joins Data Streams: Lecture 12
Conclusions • Stream warehousing • Inherently temporal data • Gathered from world-wide sources • Critical applications • Temporal management via partitioning • Consistency management • Real-time updates in a multi-year warehouse • Real-time and historical analysis on the same data sets • Stream quality management Data Streams: Lecture 12