200 likes | 212 Views
LSST Data Management Overview. Jeff Kantor LSST Corporation. Data Management is a distributed system that leverages world-class facilities and cyber-infrastructure. Archive Center NCSA, Champaign, IL 100 to 250 TFLOPS, 75 PB. Data Access Centers U.S. (2) and Chile (1) 45 TFLOPS, 87 PB.
E N D
LSST Data Management Overview Jeff Kantor LSST Corporation
Data Management is a distributed system that leverages world-class facilities and cyber-infrastructure Archive Center NCSA, Champaign, IL 100 to 250 TFLOPS, 75 PB Data Access Centers U.S. (2) and Chile (1) 45 TFLOPS, 87 PB Long-Haul Communications Chile - U.S. & w/in U.S. 2.5 Gbps avg, 10 Gbps peak Mountain Summit/Base Facility Cerro Pachon, La Serena, Chile 25 TFLOPS, 150 TB 1 TFLOPS = 10^12 floating point operations/second 1 PB = 2^50 bytes or ~10^15 bytes LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
LSST Data Management provides a unique national resource for research & education • Astronomy and astrophysics • Scale and depth of LSST database is unprecedented in astronomy • Provides calibrated databases for frontier science • Breaks new ground with combination of depth, width, epochs/field • Enables science that cannot be anticipated today • Cyber-infrastructure and computer science • Requires multi-disciplinary approach to solving challenges • Massively parallel image data processing • Peta-scale data ingest and data access • Efficient scientific and quality analysis of peta-scale data LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
DM system complexity exists but overall is tractable • Complexities we have to deal with in DM • Very high data volumes (transfer, ingest, and especially query) • Advances in scale of algorithms for photometry, astrometry, PSF estimation, moving object detection, shape measurement of faint galaxies • Provenance recording and reprocessing • Evolution of algorithms and technology • Complexities we DON’T have to deal with in DM • Tens of thousands of simultaneous users (e.g. online stores) • Fusion of remote sensing data from many sources (e.g. earthquake prediction systems) • Millisecond or faster time constraints (e.g. flight control systems) • Very deeply nested multi-level transactions (e.g. banking OLTP systems) • Severe operating environment-driven hardware limitations (e.g. space-borne instruments) • Processing that is highly coupled across entire data set with large amount of inter-process communication (e.g. geophysics 3D Kirchhoff migration) LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Performance - Nightly processing timeline for a visit meets alert latency requirement Exposure 1 Image Processing/ Detection complete Shutter close Readout complete Transfer to Base complete Exposure begins 15s 2s 6s 20s T0 - Start of 60 second latency timer T0 + 51s Time (sec) Exposure 2 15s 2s 6s 3s 20s 10s 10s Image Processing/ Detection complete Exposure begins Shutter close Readout complete Transfer to Base complete Association complete Alert generate complete LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Computing needs show moderate growth Archive Center Base Data Access Center Archive Center Trend Line LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Database Volumes • Detailed spreadsheet-based analysis done • Expecting: • 6 petabytes of data, 14 petabytes data+indexes • all tables: ~16 trillion rows (16x1012) • largest table: 3 trillion rows (3x1012) LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Large RDBMS Systems - Data Volumes *All numbers based on publicly available data LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Large RDBMS Systems - Number of Rows *All numbers based on publicly available data LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Cerro Pachon La Serena Long-haul communications are feasible • Over 2 terabytes/second dark fiber capacity available • Only new fiber is Cerro Pachon to La Serena (~100 km) • 2.4 gigabits/second needed from La Serena to Champaign, IL • Quotes from carriers include 10 gigabit/second burst for failure recovery • Specified availability is 98% • Clear channel, protected circuits LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Data Products Pipelines Application Framework The DM reference design uses layers for scalability, reliability, evolution Application Layer • Scientific Layer • Pipelines constructed from reusable, • standard “parts”, i.e. Application Framework • Data Products representations standardized • Metadata extendable without schema change • Object-oriented, python, C++ Custom Software • Middleware Layer Data Access Distr. Processing User Interface • Portability to clusters, grid, other • Provide standard services so applications • behave consistently (e.g. recording provenance) • Keep “thin” for performance and scalability • Open Source, Off-the-shelf Software, Custom Integration System Administration, Operations, Security Infrastructure Layer Computing Communications Storage • Distributed Platform • Different parts specialized for real-time • alerting vs peta-scale data access • Off-the-shelf, Commercial Hardware & • Software, Custom Integration Physical Plant LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Application Layer - open, accessible data products with fully documented quality LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Application Layer - pipelines process raw data to products Middleware API Application Framework Data Acquisition Infrastructure Calibration Products Pipeline Data Release Pipelines Nightly Processing Pipelines User Tools (Query, Data Quality Analysis, Monitoring) Image Proc. PL Deep Detection PL Detection PL Photometric Cal PL Assoc PL Astrometric Cal PL Moving Obj PL Image Coadd PL Alert PL Classification PL Data QA PL Data Products Eng/Fac Data Archive Calibration Data Products Image Archive Source Catalog Object Catalog Orbit Catalog Alert Archive LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Distributed Processing Framework - middleware for massively parallel pipelines • Policy-based automated management of: • Parallel process startup and control • Data staging and data flow • Event logging • Security • Checkpointing • Pipeline provenance recording Processing Stages Processing Slices Image Processing Pipeline Detection Pipeline Association Pipeline Image Processing Pipeline Detection Pipeline Association Pipeline Image Processing Pipeline Detection Pipeline Association Pipeline Middleware API - Distributed Processing Framework LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Data Access Framework - middleware for peta-scale data storage • Policy-based automated management of: • Location-transparency • Parallel data ingest (persistence) with indexing • Storage schema partitioning and clustering • Data quality metrics collection • Data product provenance recording Image Processing Pipeline Detection Pipeline Association Pipeline Middleware API - Data Access Framework Image Archive Source Catalog Object Catalog LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
User Interface Services - middleware for open data access • Policy-based, open interface services that automate • Queries mapped to “best” server and parallelized • Services and protocols compliant with Virtual Observatory and other astronomy data standards End User Tools (Query, Data Quality Analysis, Monitoring) Middleware API -User Interface Services Image Archive Source Catalog Object Catalog LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Validating the design - Data Challenges LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Validating the design - Data Challenge work products to date LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
Data Challenge 1 was very successful LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ
More information • Contact person: • Jeff Kantor (DM coordinator): jkantor@lsst.org LSST Conceptual Design Review September 17-20, 2007 Tucson, AZ