110 likes | 318 Views
Working with Big Data in the Geosciences - Finding the Needle in the Haystack. Sangmi Pallickara Computer Science Department Colorado State University sangmi@cs.colostate.edu. Big Data in Geosciences. Volume Velocity Variety. Storage must be over a collection of machines.
E N D
Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi PallickaraComputer Science DepartmentColorado State Universitysangmi@cs.colostate.edu
Big Data in Geosciences • Volume • Velocity • Variety
Storage must be over a collection of machines • Avoid central coordinators • Cope with failures • Preserve data locality without introducing storage imbalances • And the accompanying query hotspots • Support range queries and fast ingest of new data
Galileo Design Considerations • Symmetric storage nodes • No special-function or “controller” nodes • Storage and retrievals may go to any node, and will be forwarded to the targeted node(s) • Incremental scale-up • Failure-resiliency • Accounts for geospatial component in data
Galileo key features • Support for large numbers (109) of small files • High throughput storage and retrieval • Data is multidimensional with multiple types • Time-series data • Support for exact match and range queries (with wildcards) along multiple dimensions • Support for multiple data formats • netCDF, BUFR, HDF 4/5, and data from the Defense Meteorological Satellite Program
Planned/Ongoing deployments for Galileo • International Centre for Radio Astronomy Research • Australian SKA Pathfinder telescope • ~ 1 PB of time-series data • CSU Atmospheric Sciences & Precision Wind (Boulder) • Short-term wind forecast predictions • CSU Civil & Environmental Engineering department • Sustainable management of watershed systems • Climate.org
Related work • Google File system • BigTable • Distributed Hash Table (DHT) based Systems • Pastry, Chord, Dynamo, and CAN • SciDB • MongoDB
Dataset used in performanceevaluations • Sourced from NOAA NAM Project • Dimensions/Features: • Geospatial: Latitude, Longitude • Time Series: Start Time, End Time • Temperature • Relative Humidity • Wind Speed • Snow Depth • Composed of 1 billion files (8 TB)
Storage Throughput • Block is about 8 KB of data • 56,000 blocks per second in a system with 48-nodes
Thank you! • Galileo • http://galileo.cs.colostate.edu • Sangmi Pallickara • sangmi@cs.colostate.edu • http://www.cs.colostate.edu/~sangmi