190 likes | 857 Views
Dan Han, Eleni Stroulia University of Alberta. A 3-Dimensional Data Model for Large Time-Series Dataset Analysis in HBase. Outline. Background and Motivation Related Work A 3-Dimensional Data Model in HBase Case Study and Experiment Results Discussion Conclusions and Future Work.
E N D
Dan Han, Eleni Stroulia University of Alberta A 3-Dimensional Data Model for Large Time-Series Dataset Analysisin HBase MESOCA 2012
Outline • Background and Motivation • Related Work • A 3-Dimensional Data Model in HBase • Case Study and Experiment Results • Discussion • Conclusions and Future Work MESOCA 2012
Migrating Applications To the Cloud • Cloud is an attractive computing platform • Elasticity, Excellent Scalability, High Availability, Low operating cost • Applications are moving to the cloud • Social networking, online shopping, monitoring system • Time-Series data: grows monotonously over time • Analysis of large scale time-series data • May lead to new knowledge • May lead to Improvements of existing services • Success adoption of this movement paradigm requires a new model of storage MESOCA 2012
Migrating RDBMS ContentTo NoSQL • From RDBMS to NoSQL storage systems • Enable the storage of big data, in order of row key • Scale horizontally across storage nodes easily • Not much data-organization support • Migration challenges • Few experiences and principles to follow • Steep learning curve for programming • Much experimentation is required before deployment • Much time is spent in designing the data schema • The “wrong” schema may lead to inefficient, high-latency queries MESOCA 2012
We need Design Patterns for HBase Schemas • Our objective is to develop a systematic method for • Guiding data organization in NoSQL databases, given • the types of data stored, • the amount of data • its usage patterns • We start our investigation with HBase • A NoSQL database offering, built on top of Hadoop • Parallel Distributed Computation • MapReduce Framework • Coprocessor Framework MESOCA 2012
Related Work • Talks in HBaseCon2012, held in May • Data schema and Coprocessor are two main topics • Experience from 30 enterprises, such as Facebook, Yapmap, eBay, Adobe • Organizing time-series data into period-specific “buckets” • OpenTSDB: a distributed scalable time series database, written on top of HBase • A data Model in Cassendra, another NoSQL database offering • Applied into our case study MESOCA 2012
Data Organization in HBase • Cell in HBase • (Row, Family: Column, Version) => (X,Y,Z) = value VS Y Z MESOCA 2012 Y X X
Case study: The Datasets • Cosmology Dataset • Product of an N-body simulation • Three types of particles: dark matter, gas and star • Particles evolve over a series of discrete timestamps • Each snapshot records the properties of all particles at the time of the snapshot • 9 snapshots, consists of 321,065,547 particles • Bixi Dataset • Data from a bicycle-renting service in the city of Montreal • Every minute, the statistic information about bike usage a station is collected by the sensor • 96,842 data points involved MESOCA 2012
Three Schemas for the Cosmology Dataset Schema1 Schema2 Schema3 Region 24-2-33446666 2-33446666 2-00005533 Region 64-2-33559999 2-33550000 2-66664433 Region 84-2-33550000 2-33559999 2-99995533 Z MESOCA 2012 Y X
Three Schemas for the Bixi Dataset Schema1 Schema2 Schema3 Time metrics X Time MESOCA 2012 metrics Time X X
Experiment Results • Experiment Environment • Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support) • A four-node cluster on virtual machines • Quires for each dataset • Three Queries of Cosmology dataset from related research • One query of Bixi dataset from business requirement • Query processing Implementation • Native java API • User-Level Coprocessor Implementation MESOCA 2012
Query1 of Cosmology Dataset • Get all the particles of this type in this snapshot whose property matches the expression MESOCA 2012
Query2 of Cosmology Dataset • Get all the particles added/destroyed between S1 and s2 MESOCA 2012
Query3 of Cosmology Dataset • Get the values of the property for the given set of particles across the selected snapshots. MESOCA 2012
Bixi Query • For a given list of stations and a time, get their average bike usage for last 1, 2, 4, 8 and 16 days MESOCA 2012
Discussion • “Qualitative” versus “Quantitative” Suggestions • Dynamic Data versus Static Data • Historical Dataset versus Real-Time Datasets • Supported versus Non-Supported Datasets MESOCA 2012
Conclusion • A 3-dimensional data model • Improved performance can be got from the data schema that use the version dimension of HBase • Fit in “write-once, read-many” system • Monitoring system • Sensor-based system • Version-based analysis MESOCA 2012
Future Work • More Evaluation of this data model • scalability, elasticity, and utilization • How to design data model for other datasets • Spatial dataset • Graphic dataset MESOCA 2012
Questions?Thank you MESOCA 2012