1 / 34

Data Rods:

Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods. David Gallaher (1) , Qin Lv (2) , Glenn Grant (1) , Garrett Campbell (1). National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA

mcnatt
Download Presentation

Data Rods:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA 1

  2. The National Snow and Ice Data Center Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data Manages and distributes scientific data Supports data users Performs scientific research University of Colorado at Boulder Cooperative Institute for Research in Environmental Sciences World Data Center for Glaciology (since 1976) Affiliations and Sponsorship Educates the public about the cryosphere Creates tools for data access

  3. Data Rods - Project Basis The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets.

  4. Objective: Remote Sensing Data Analysis The Problem: • Data sets are becoming too large to move over the internet • Need for basic Boolean logic for time-series anomaly detection • Data downloads for long time-series analysis are especially cumbersome

  5. Analysis Challenges • A wide variety of data formats • Ever-increasing data set sizes • Myriad analysis and visualization requirements • There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough) • Lack of direct access to the data (ie albedo > 15%) • Our current directory trees impede data access (We really need to consider a database)

  6. “Big Data” Considerations: • Search, Order and Transmission of data is ending. • We must develop systems where the data stay fixed and analyses are rendered against it • Rapid, scalable data access across time and space • Direct query of the data, not just the metadata (we need more than what, where, when) • Web-based spatio-temporal analysis and visualization 6

  7. Database Choice • Fast and efficient storage, query and retrieval of entire data sets – not just the metadata • Ability to store colossal amounts of small files • Relational databases can't handle it. The tables grow too big. (Object-relational is no better) • Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis • A “pure-object” database seen as best choice

  8. The Data Rods Project • The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval, filtering, and analysis of massive data sets. • We’ll cover the following: • Database design • Status on development • Basic analysis examples and performance • Planned analysis and potential applications

  9. Database design Gridded data is key. For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used. Common resolutions between data sets (1km, 5km, etc) and point data

  10. The nesting relationship of differing resolutions in EASE-Grid

  11. Data Rods Concept Time Y coordinate X coordinate

  12. Database Systems Development Data Rod Objects Object Database Design Cryospheric Change Analysis Basic Data Management (query & index) Passive Microwave Ease Grid Processing Object Interface User Interface Visual Infrared Pixel Grid Sampling Pattern Search (input pattern or trend) Data Input User Input Active Microwave Time Object Database Loading • Automated Pattern Discovery • Anomaly Detection • Trend Detection • Cycle Detection Radar Y coordinate Data Rod Updating Other X coordinate

  13. Pure-Object Database • Object persistence/instantiation is directly to/from the database – no Java Spring or Hibernate needed • Not object-relational (examples include Versant, ObjectDB, db4o, Objectivity) • Not as limited by size • Fast interactions across databases • Simple, efficient schema • Next: schema design

  14. Object Database Schema • Each image pixel is an object • Data rods are time-series collections of pixels • Each data rod can be analyzed independently • Adjacency analysis by row/col or lat/lon

  15. Longitude Database Creation • Gridded data sets • Standardized grid dimensions • Visualize as layers of imagery through time (days to decades) • Lends itself well to time-series analysis Time Latitude

  16. Status – Database Administration • 5 AVHRR databases, each with 5 years of imagery (<100 GB each, administratively easier) • Surface mask databases for northern hemisphere at 5 km and 25km • SSM/I database, 25 years of daily 25 km data at all frequencies and polarizations • Selected MODIS database at 250 Meter resolution • ~600 GB total • No upper limit to database except disk space

  17. AVHRR Database Creation • Initial demonstration region is Greenland • 25 years of daily multi-spectral AVHRR data at 5 km resolution • 9000+ images • 2 billion+ pixel objects total • Each pixel object is independently accessible for query

  18. Database Flexibility • Data can be spread across many databases • Transparent queries across databases • Methods (routines) can be attached to the data rods to add functionality such as statistical analysis • Data fusion: analyses may span multiple data types, resolutions, time spans • Data Rods supports NetCDF output

  19. Simple AVHRR Object Database Time Test • Built a using AVHRR 5km data from 1995-1999 • 2 visible channels, 3 IR channels, 3 references plus albedo, skin temperature and cloud mask • Database includes location class, time stamp class and metadata • 213,000 data rods covering 5-years over Greenland • 1 Data rod contains 1825 pixels • Pixels = 388,725,000 each with 11 variables/pixel • Variables = 4.2 billion coded short integer values

  20. Example Analysis Using Object Databases • All queries run on a singe processor, single thread • Example #1: Queries and plots on single database • Example #2: Queries and plots on multiple databases • Example #3: Advanced Spatiotemporal Analysis • 1 Data rod contains 1825 pixels • Pixels = 388,725,000 each with 11 variables/pixel • Variables = 4.2 billion coded short integer values • We will move to multi-tread, multiprocessor once we have the design finalized (this is a research project)

  21. Using Single AVHRR Object Database Time Test • Single processor under load • 5-year plots returned in 2-10 seconds. • Cached data plots returned in ½ second. • Images in 10 seconds

  22. Multi Data RodSelection • Seven locations selected across 5 years simultaneously • Selected Temperature Brightness and Albedo output • Again caching is much faster

  23. AVHRR albedo statistics May average, 1981 – 2005 Camp Century: Mean: 0.801 Std. dev.: 0.077 Example Analysis of Greenland & 5 databases Using 5 5-year Rods and Statistics (1 min or 5 secs cached) Summit Station: Mean: 0.819 Std. dev.: 0.069 Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media. Swiss Camp: Mean: 0.817 Std. dev.: 0.070 GISP Ice Core Camp: Mean: 0.802 Std. dev.: 0.071

  24. Temporal Analysis of Single Rods • Descriptive Statistical functions • Spatiotemporal data selection • Filtering by value • Anomaly detection • Also: • Image generation • Inter-database data fusion

  25. Broad Spatiotemporal Analysis (This took some time) • Statistical analysis repeated at every grid cell. • Intersection of surface mask database and AVHRR database: only pixels on the ice sheet were processed. • Bad data filtered out. • Multivariate: cloud mask used to exclude cloudy pixels from albedo averages. • All 2 billion objects queried and analyzed

  26. Analysis Example: Sea Ice Temporal Query } • We would like to remove clouds from the image (clouds move faster than ice so find minimum Albedo for open water) • Moving 8-day window through datarod • Minimum albedo in temporal window • Pseudocode example query: t8 t1 datarod = database.getDatarod(row,col) albedo = datarod.getMinAlbedo(t,t+7) Datarod time-series of pixels

  27. Analysis result: Sea Ice Detection • Technique for removing clouds from the image • Composite image created from Data Rods’ time series • Lowest AVHRR albedo over an 8-day period One of the Original images Remaining objective: exclude lingering clouds

  28. Analysis Potential: Rapid Data Fusion • Loss of AMSR-E decreases sea ice detection capability • Data Rods AVHRR/SSM/I product fusion may fill the gap • Can be validated against AMSR-E sea ice record. • Fused product • AVHRR 8-day • SSM/I • + • = • High-res sea ice extent, no clouds • Cloud free with good sea ice detection but low resolution • High resolution sea ice detection – still some clouds

  29. Performing this lake detection analysis conventionally took 6 months (downloading & gridding & image analysis) With Data Rods, the analysis was done in 2 days (single tread, single processor)

  30. What’s Next-Ongoing Efforts • Newest version of ODB software has multi-threaded capability – to take advantage of multiprocessor machines to reduce query times • Investigating Data rod performance on the Janus supercomputer with Pan-Arctic extent • User Interface to Data Rod database

  31. Creating 1000s of Databases for Use with Massive Parallel Machines • Each database is small enough to be held in memory for each CPU (uses MPI calls) • Each database covers 5ox5ox25 years of Data Rods • Each database is capped (fixed for minimal changes) • Changes are added to the present year database for each 5ox5o

  32. Creating 1000s of Databases for Use with Massive Parallel Machines • With this database it should be possible perform analysis at Internet speeds • Multi-sensor analysis is relatively simple • We are starting the database loading now • 100TB database testing will occur over the summer

  33. Summary • We can now perform high-speed time-series analysis on the server-side without downloads • Scalable, massive remote sensing databases • Accelerated analysis compared to traditional “search, order and transmission”’ methods • Interactions across data sets – data fusion • Developing UI and additional analysis tools • Allow users interactive access to the data

  34. NSIDC Data Rods Project Thank You The Data Rods project is funded by the National Science Foundation through grant: ARC 0941442 Interesting in testing Data Rods? Please contact us at: david.gallaher@nsidc.org

More Related