1 / 48

Terascale Data Organization for Discovering Multivariate Climatic Trends

Terascale Data Organization for Discovering Multivariate Climatic Trends. Wesley Kendall, Markus Glatter, and Jian Huang The University of Tennessee, Knoxville Tom Peterka, Robert Latham, and Robert Ross Argonne National Laboratory. Drought Analysis.

Download Presentation

Terascale Data Organization for Discovering Multivariate Climatic Trends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Terascale Data Organization for Discovering Multivariate Climatic Trends Wesley Kendall, Markus Glatter, and Jian Huang The University of Tennessee, Knoxville Tom Peterka, Robert Latham, and Robert Ross Argonne National Laboratory

  2. Drought Analysis • In the past ten years, drought has averaged about 2 billion dollars in damage, with 10 billion dollar damages occurring in 2002 alone [ncdc.noaa.gov] Dried-up Xiliu Lake. Courtesy of theepochtimes.com

  3. Drought Analysis • Many parameters and uncertainty • Low vegetation and low rainfall. What is low? • High drought index. What is high? • Extended period of time. How long is extended? • Abnormal for a given region. What is abnormal? • A system that can turn these knobs of uncertainty is highly useful for advancing scientific discovery

  4. Mountain Pine Beetle Infestation • Mountain pine beetles have destroyed millions of trees on the west coast, causing significant ecological and economical damages • Early warning system • Need supercomputing • Need on-the-fly analysis [Hargrove et al., PERS 2009] 2009 mountain pine beetle damages (red) of forests in Colorado. Courtesy of Bill Hargrove

  5. A System For Full Range Analysis Of Scientific Data On-The-Fly • To thoroughly examine problem, must do it at full scale, not in bits and pieces • Full range analysis is crucial to climate science • High spatial resolution is needed for global ecosystem dynamics • High temporal resolution is needed for inter-annual climate variability • On-the-fly analysis is also highly important • Integral aspect of early warning systems

  6. Our Driving Application • NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) project • Data is large and complex • High spatial resolution, up to 250 meter • High temporal resolution, up to 1 day • Many products, many variables • Query for qualitative events like drought in 1.1 TB of MODIS • 8 day intervals from Feb. 2000 – Feb. 2009 • 31,200 x 21,600 grid • 2 variables: vegetation and water index

  7. Our Focus • I/O performance • Good I/O implementation can yield orders of magnitude improvement • ADIOS – 1,400 vs. 1.4 seconds to write 7 GB file [Lofstead et al., IPDPS 2009] • S3D tweak – 1,443 vs. 6 seconds to write 6.5 MB file [Ross et al., SC Tutorial 2009] • No extended time to prepare data • Handle application-native formats • Scalability and load balancing

  8. Jaguar Cray XT4 at ORNL • 7,832 quad-core 2.1 GHz AMD Opteron processors • 8 GB of memory per processor • Lustre file system, 144 Object Storage Targets (OST) Jaguar Cray XT4. Courtesy of ornl.gov

  9. How? Three Main Components: I/O, Querying, and Analysis

  10. I/O Component Time-Varying Output Data

  11. I/O Component Parallel File System Striping

  12. I/O Component I/O across Time-Varying Files

  13. Query Component Data Distribution for Load Balanced Queries

  14. Query Component Parallel Queries

  15. Analysis Component Parallel Sort

  16. Analysis Component Analysis / Write Results

  17. Repeat Repeat Query / Analysis Process

  18. I/O Component

  19. I/O Component • Ability to work with common formats • Use parallel netCDF library for netCDF-3 files • Maximize use of collective I/O in MPI-IO • More large and contiguous reads • Work with time-varying data across multiple files • Need file assignment process • Greedy assignment process is flexible and robust

  20. Greedy File Assignment Four Processes, Three Files

  21. Greedy File Assignment Each Process has Quota to Fill

  22. Greedy File Assignment First Stage

  23. Greedy File Assignment Second Stage

  24. I/O Component • No netCDF benchmarks exist for Jaguar • IOR benchmark results for raw data on Jaguar [Yu et al., IPDPS 2008] • 42 GB/s bandwidth for one file on 1K processes • 36 GB/s for one file per process on 1K processes

  25. I/O Bandwidth Results • Achieved 28 GB/s, 75% of IOR benchmark

  26. Query Component

  27. Query Component • Compound Boolean range queries • Vegetation < 0.2 and water < 0.3 • Conceptual queries[Glatter et al., VIS 2008] • Regular expressions • Beginning of Spring, [-.4-.4]*T[.4-max]?* Conceptual query for beginning of Spring

  28. Query Component • Common query driven visualization methods • Bitmap indexed [Stockinger et al., VIS 2005] • Optimal search, but lengthy serial index building (≈1 minute for 1.25 GB without I/O) • Large storage overhead (≈90% of dataset) • Tree based [Glatter et al., VIS 2006] • Search time depends on query, but highly scalable • Costly load balancing (≈8 hours for 105 GB with I/O) • Search phase isn’t bottleneck, need scalability • Use tree based method and reduce load balancing time

  29. Query Component • Want to establish trade off between time to load balance and the time to query • Test five load balancing schemes • Hilbert-order sort, round robin distribution • Z-order sort, round robin distribution • Round robin distribution • Random distribution • No distribution

  30. Time Comparison • Random distribution, although simple, achieves best trade off between load balancing time and query time

  31. Analysis Component

  32. Analysis Component • Sort items for data coherence, then perform appropriate analysis • Use parallel sample sorting algorithm • Shown to work best on large data [Blelloch et al., TCS 1998]

  33. Discovering Multivariate Climatic Trends • We used this system for looking at two important problems in climate research: drought and time-lag analysis Wes, could I use this to look at global warming? Al Gore

  34. Drought Analysis • Multivariate, complex problem space • Low vegetation index, low water index • High drought index • Prolonged period • Abnormal occurrence • Our test case • Query for vegetation index < 0.5 and water index < 0.3 • Compute drought index, keep values > 0.5 • Must be consistent for at least a month (4 timesteps) with maximum of two separate occurrences

  35. Drought Analysis

  36. 2006 Mexico Drought

  37. 2001 - 2002 Canadian Prairie Drought

  38. Time-Lag Analysis • Query for • First snow • First occurrence of 0.7 < water index < 0.9 • Vegetation green-up • First occurrence of 0.4 < vegetation index < 0.6 that happens after first snow • Compute time between events, an indicator of length of winter and severity of snow season

  39. 2006 Time-Lag Analysis

  40. Rocky Mountains In Canada

  41. Colorado Ski Resorts

  42. Canadian Boreal Forests

  43. Application Timing Results I/O Preprocessing Trend Extraction Drought application timing results in seconds I/O Preprocessing Trend Extraction Time-lag application timing results in seconds

  44. Conclusions • System provides robust approach for settings where many parameters need to be adjusted with immediate feedback • Greedy assignment of I/O is a practical solution with good performance • Random distribution of data, while simplistic, demonstrates advantage in performance for on-the-fly analysis

  45. Future Work • Query datasets out of core • Handle more complex problems that will require orders of magnitude more queries

  46. Acknowledgements • Funding for this work is primarily through the DOE SciDAC Institute of Ultra-Scale Visualization (http://www.ultravis.org) • Important components of the overall system were developed while supported in part by a DOE Early Career PI grant awarded to Jian Huang (No. DE- FG02-04ER25610) and by NSF grants CNS-0437508 and ACI-0329323 • The MODIS dataset was provided by NASA (http://modis.gsfc.nasa.gov) • This research used resources of the National Center for Computational Science at Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC, for DOE under Contract No. DE-AC05-00OR22725 • We would also like to thank Forrest Hoffman and David Erickson from Oak Ridge National Laboratory, and Bill Hargrove from the Eastern Forest Environmental Threat Assessment Center

  47. Questions? • Wesley Kendall - kendall@eecs.utk.edu anl.gov utk.edu ultravis.org

  48. References • W. Yu, J. S. Vetter, and S. Oral, “Performance Characterization and Optimization of Parallel I/O on the Cray XT,” in IPDPS `08: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008. • M. Glatter, C. Mollenhour, J. Huang, and J. Gao, “Scalable Data Servers for Large Multivariate Volume Visualization,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 5, pp. 1291–1298, 2006. • J. Lofstead, F. Zheng, S. Klasky, and K. Schwan, “Adaptable, Metadata Rich IO Methods for Portable High Performance IO,” in IPDPS `09: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2009. • W. W. Hargrove, J. P. Spruce, G. E. Gasser, and F. M. Hoffman, “Toward a National Early Warning System for Forest Disturbances Using Remotely Sensed Canopy Phenology,” Photogrammetric Engineering and Remote Sensing, Vol. 75, No. 10, pp. 1150–1156. • G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. Smith, and M. Zagha, “An Experimental Analysis of Parallel Sorting Algorithms,” Theory of Computing Systems, vol. 31, no. 2, pp. 135–167, 1998. • M. Glatter, J. Huang, S. Ahern, J. Daniel, and A. Lu, “Visualizing Temporal Patterns in Large Multivariate Data using Textual Pattern Matching,” IEEE Transactions on Visualization and Computer Graphics, vol. 14, no. 6, pp. 1467–1474, 2008. • K. Stockinger, J. Shalf, K. Wu, and E. Bethel, “Query-Driven Visualization of Large Data Sets,” in VIS `05: Proceedings of the IEEE Visualization Conference, October 2005, pp. 167–174. • R. Ross, R. Latham, M. Unangst, and B. Welch, “Parallel I/O in Practice”, Tutorial at SC `09: ACM / IEEE Supercomputing Conference, November 2009.

More Related