220 likes | 295 Views
Indexing and Parallel Query Processing Support for Visualizing Climate Datasets. Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory. Outline. Motivation and Introduction Background System Overview and Optimization Experiment
E N D
Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring† *The Ohio State University †Los Alamos National Laboratory
Outline Motivation and Introduction Background System Overview and Optimization Experiment Conclusion
Motivation • Science becomes increasingly data driven; • Strong desire for efficient data visualization; • Challenges: • Fast data generation speed • Slow disk IO and network speed • Worse performance during visualization • Different kinds of subsetting requests • Difficult and Unnecessary to visualize all the data
Data Subsetting in Paraview • A widely used data analysis and visualization application • Problems: Load + Filter mode • Load the entire data set • Data filtering in visualization level • Threshold Filter: based on values • Extract Subset Filter: based on dimension info • Grid transformation needed during filtering • Regular Structured Grid -> Unstructured Grid
A Faster Solution • Subset at the I/O level • User specifies the subset in one query for both dimension and value ranges • Reduced I/O time and memory footprint • SQL queries in ParaView • Query over Dimensions – API support • Query over Values - Indexing • Bitmap Indices and Parallel Bitmap Indices • Efficient subsetting over values
Background: Bitmap Indexing • Fastbit: widely used in Scientific Data Management • Suitable for float value for binning small ranges • Run Length Compression(WAH, BBC) • Compress bitvector based on continuous 0s or 1s
Bitmap Index and Dim Subset • Run-length Compression(WAH, BBC) • Good: compression rate, fast bitwise operation; • Bad: ability to locate dim subset is lost; • Two traditional methods: • With bitmap indices: post-filter on dim info; • Without bitmap indices: post-filter on values; • Two-phase optimization: • Index Generate: Distributed Indices over sub-blocks; • Index Retrieval: Transform dim subsetting info into bitvectors, and support fast bitwise operation;
System Overview Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.
Optimization 1: Distributed Index Generation Study relationship between Queries and Partitions. Partition the data based on Query Preference
Index Partition Strategy • α rate: Participation rate of data elements • Number of elements in indexing / Total data size • Worst: All elements have to be involved • Ideal: Elements exact the same as dim subset • Partition Strategies: • Strategy 1: αis proportional to dim subsetting percentage and inversely proportional to number of partitions. • Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim. • Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.
Optimization 2: Index Retrieval Post-filter?
Parallel Index Architecture L1: data file L2: variable L3: data block
Experiment Setup • Goals: • SQL subsetting vs. Load + Filter in Paraview • Scalability of parallel indexing method • Indexing and Partition Strategy vs. FastQuery • Dataset: • Parallel Ocean Program • Data size: 33.6 GB • Data format: NetCDF(array based) • Environment: • IBM Xeon Cluster 8 cores, 2.53GHZ • 12 GB memory
Efficiency Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method is better than filtering when data subset < 60% • Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method • Index m1: Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter
Memory Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method has much smaller memory cost than filtering method • Two phase optimization only has small extra memory cost • Index m1: Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter
Scalability with Different Proc# • Data size: 8.4 GB • Proc#: 6, 24, 48, 96 • Input: 100 queries • X pivot: subset percentage • Y pivot: time • Each process take care of one sub-block • Good scalability as number of processes increases
Alpha Rate with Different Proc# • Data size: 8.4 GB • Proc#: 6, 24, 48, 96 • Input: 100 queries • X pivot: subset percentage • Y pivot: Alpha Rate • More number of processes means more index partitions • Good participation rate when selecting a smaller percentage data subset
Alpha Rate and IO Access Times Comparison with FastQuery • FastQuery: • Build relational table view over scientific dataset • Difference: doesn’t consider multi-dimension data features • Data size: 8.4 GB, 48 processes • Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall • Input: 100 queries for each query type
Efficiency Comparison with FastQuery • Data size: 8.4 GB • Proc#: 48 • Input: 100 queries for each query type • Achieved a 1.41 to 2.12 speedup compared with FastQuery
Conclusion Big data issue in data analysis and visualization Find exact data subset in IOlevel with SQL interface and bitmap indexing A good speedup compared with filtering method Data partition strategy and parallel indexing A good speedup compared with FastQuery