320 likes | 440 Views
Support Data Sampling Using Bitmap Indices over Scientific Dataset. Yu Su*, Gagan Agrawal*, Jon Woodring † *The Ohio State University † Los Alamos National Lab. Outline. Motivation and Introduction Background System Overview Index Sampling and Optimizations Experiment Results
E N D
Support Data Sampling Using Bitmap Indices over Scientific Dataset Yu Su*, Gagan Agrawal*, Jon Woodring† *The Ohio State University †Los Alamos National Lab
Outline Motivation and Introduction Background System Overview Index Sampling and Optimizations Experiment Results Conclusion
Motivation • Science becomes increasingly data driven; • Strong requirement for efficient data analysis; • Challenges: • Fast data generation speed • Slow disk IO and network speed • Some number from road-runner EC3 simulation • 40003 particles, 36 bytes per particle => 2.3 TB/time • 10GB/s • 230 times different, and bigger in future • Extremely hard to analyze or visualize entire data
Existing Data Management Methods Client-side Subsetting Server-side Subsetting Challenges? No subsetting request? Data subset still big? Simple Request Advanced Request
Server-side Data Sampling • Statistic Sampling Techniques: • sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. • Examples: • Simple Random Sampling • Stratified Random Sampling • Information Loss is Unavoidable • Error Metrics: • Mean, Variance • Histogram • QQPlot
Data Sampling Challenges • Challenges in Scientific Data Management: • Data Accuracy. Fail to consider data features. • Data Value Distribution • Data Spatial Locality • Error Calculation is time-consuming. • Can’t support sampling over flexible data subset • Data has to be reorganized • Bitmap indexing has been widely used • Support efficient subsetting over values • Fastbit, FastQuery, our ICPP work
Our Solution • A server-side subsetting and sampling framework. • Standard SQL interface • Data Subsetting: Dimensions, Values • TEMP(longitude, latitude, depth) ; • Flexible sampling mechanism • Support Data Sampling over Bitmap Indices • No data reorganization is needed • Generate an accurate error metrics result • Support Error Prediction before sampling the data • Support data sampling over flexible data subset
Background: Bitmap Indexing • Widely used in Scientific Data Management • Suitable for float value for binning small ranges • Run Length Compression(WAH, BBC) • Compress bitvector based on continuous 0s or 1s
Data Sampling Using Bitmap Indices • Features: • Different bitvectors reflect the value distribution; • Each bitvector keep the data locality; • Row major, Column major • Hilbert Curve, Z-order Curve • Method: • Perform stratified sampling within each bitvector; • Multi-level indexing generates multi-level samples;
Stratified Sampling over Bitvectors S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride
Error Prediction • Calculate errors based on bins instead of samples • Indices classifies the data into bins; • Each bin corresponds to one value or value range; • Find a represent value for each bin: Vi; • Equal probability is forced for each bin; • Compute number of samples within each bin: Ci; • Predict error metrics based on Vi and Ci; • Represent Value: • Small Bin: mean or median value • Big Bin: lower-bound, upper-bound, mean value
Error Prediction Metadata Mean Variance Histogram QQPlot Mean, Variance over Strides
Error Prediction Formula (1) Mean, Variance: Histogram:
Error Prediction Formula (2) QQPlot
Data Subsetting + Sampling S1: Find value subset Val = 1.2 ID = (11, 21) S2: Find Spatial ID subset S3: Perform Sampling on Subset
Multi-attributes Subsetting and Sampling Support S1: Generate Value Interval for each attribute S2: Combine Single Value Intervals to mbins S3: Generate Bitmap Indices based on mbins
Experiment Setup • Environment: • Darwin Cluster: 120 nodes, 48 cores, 64 GB memory • Dataset: • Ocean Data – Regular Multi-dimensional Dataset • Cosmos Data – Discrete Points with 7 attributes • Sampling Method: • Simple Random Method • Simple Stratified Random Method • KDTree Stratified Random Method • Big Bin Index Random Method • Small Bin Index Random Method
Experiment Goals • Two Applications after Sampling: • Data Visualization - Paraview • Data Mining - K-means in MATE • Goals: • Efficiency and Accuracy with and without sampling • Accuracy between different sampling methods • Efficiency between different sampling methods • Compare Predicted Error with Actual Error • Speedup for sampling over data subset
Efficiency and Accuracy of Sampling over Ocean Data • Data size: 11.2 GB TEMP • Network Transfer Speed: 20 MB/s • Speedup compared to original dataset: • 25% - 1.87; 12.5% - 3.72; • 1% - 10.97; 0.1% - 31.62; • Error Metrics: Variances over Strides • Value diffs between original and samples • Information Loss Percent: • 25% - 0.39%; 12.5% - 0.56%; • 1% - 0.91%; 0.1% - 1.18%;
Efficiency and Accuracy of Sampling over Cosmos Data • Data size: 16 GB (VX, VY, VZ) • Network Transfer Speed: 20 MB/s • Speedup compared to original dataset: • 25% - 2.11; 12.5% - 4.30; • 1% - 21.02; 0.1% - 60.14; • Kmeans: • 20 clusters, 3 dims, 50 iterations • MATE: 16 threads • Error Metrics: Means of cluster centers • Much better than other methods
Data Sampling Time • Data size: 1.4 GB • Our method: extra striding cost • Compare: small bin random cost 1.19 – 3.98 most time compared with KDTree random method
Error Calculation Time • Error Prediction: O(m) • Error Calculation • QQPlot: O(slogs) • Others: O(s) • Compare: Error Prediction achieved >28 times speedup compared with error calculation
Total Time based on sampling times • Depends on sampling times • Comparison: Small bin methods achieved a 0.37 – 5.29 times speedup compared with KDTree random method
Subsetting Optimization Subset over values • Smaller Index Loading Time • Smaller Sampling Time • Speedup: • 2.28 - 21.54 for small bin • 2.25 - 13.56 for big bin • Smaller Sampling Time • Speedup: • 1.37 - 2.48 for small bin • 1.67 - 3.02 for big bin Subset over Spatial IDs
Conclusion ‘Big Data’ issue brings challenges ; Data sampling is necessary for data analysis; Perform server-side sampling over bitmap indices; Error Prediction and Sampling based on subset; Achieve a good accuracy and efficiency.