1 / 22

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets. Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory. Outline. Motivation and Introduction Background System Overview and Optimization Experiment

lihua
Download Presentation

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring† *The Ohio State University †Los Alamos National Laboratory

  2. Outline Motivation and Introduction Background System Overview and Optimization Experiment Conclusion

  3. Motivation • Science becomes increasingly data driven; • Strong desire for efficient data visualization; • Challenges: • Fast data generation speed • Slow disk IO and network speed • Worse performance during visualization • Different kinds of subsetting requests • Difficult and Unnecessary to visualize all the data

  4. Data Subsetting in Paraview • A widely used data analysis and visualization application • Problems: Load + Filter mode • Load the entire data set • Data filtering in visualization level • Threshold Filter: based on values • Extract Subset Filter: based on dimension info • Grid transformation needed during filtering • Regular Structured Grid -> Unstructured Grid

  5. A Faster Solution • Subset at the I/O level • User specifies the subset in one query for both dimension and value ranges • Reduced I/O time and memory footprint • SQL queries in ParaView • Query over Dimensions – API support • Query over Values - Indexing • Bitmap Indices and Parallel Bitmap Indices • Efficient subsetting over values

  6. Background: Bitmap Indexing • Fastbit: widely used in Scientific Data Management • Suitable for float value for binning small ranges • Run Length Compression(WAH, BBC) • Compress bitvector based on continuous 0s or 1s

  7. Bitmap Index and Dim Subset • Run-length Compression(WAH, BBC) • Good: compression rate, fast bitwise operation; • Bad: ability to locate dim subset is lost; • Two traditional methods: • With bitmap indices: post-filter on dim info; • Without bitmap indices: post-filter on values; • Two-phase optimization: • Index Generate: Distributed Indices over sub-blocks; • Index Retrieval: Transform dim subsetting info into bitvectors, and support fast bitwise operation;

  8. System Overview Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.

  9. Optimization 1: Distributed Index Generation Study relationship between Queries and Partitions. Partition the data based on Query Preference

  10. Index Partition Strategy • α rate: Participation rate of data elements • Number of elements in indexing / Total data size • Worst: All elements have to be involved • Ideal: Elements exact the same as dim subset • Partition Strategies: • Strategy 1: αis proportional to dim subsetting percentage and inversely proportional to number of partitions. • Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim. • Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.

  11. Optimization 2: Index Retrieval Post-filter?

  12. Parallel Index Architecture L1: data file L2: variable L3: data block

  13. Experiment Setup • Goals: • SQL subsetting vs. Load + Filter in Paraview • Scalability of parallel indexing method • Indexing and Partition Strategy vs. FastQuery • Dataset: • Parallel Ocean Program • Data size: 33.6 GB • Data format: NetCDF(array based) • Environment: • IBM Xeon Cluster 8 cores, 2.53GHZ • 12 GB memory

  14. Efficiency Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method is better than filtering when data subset < 60% • Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method • Index m1: Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter

  15. Memory Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method has much smaller memory cost than filtering method • Two phase optimization only has small extra memory cost • Index m1: Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter

  16. Scalability with Different Proc# • Data size: 8.4 GB • Proc#: 6, 24, 48, 96 • Input: 100 queries • X pivot: subset percentage • Y pivot: time • Each process take care of one sub-block • Good scalability as number of processes increases

  17. Alpha Rate with Different Proc# • Data size: 8.4 GB • Proc#: 6, 24, 48, 96 • Input: 100 queries • X pivot: subset percentage • Y pivot: Alpha Rate • More number of processes means more index partitions • Good participation rate when selecting a smaller percentage data subset

  18. Alpha Rate and IO Access Times Comparison with FastQuery • FastQuery: • Build relational table view over scientific dataset • Difference: doesn’t consider multi-dimension data features • Data size: 8.4 GB, 48 processes • Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall • Input: 100 queries for each query type

  19. Efficiency Comparison with FastQuery • Data size: 8.4 GB • Proc#: 48 • Input: 100 queries for each query type • Achieved a 1.41 to 2.12 speedup compared with FastQuery

  20. Conclusion Big data issue in data analysis and visualization Find exact data subset in IOlevel with SQL interface and bitmap indexing A good speedup compared with filtering method Data partition strategy and parallel indexing A good speedup compared with FastQuery

  21. Thanks

More Related