160 likes | 375 Views
Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager. LSST Data Management Principal Responsibilities. Archive Raw Data: Receive the incoming stream of images that the Camera system generates to archive the raw images .
E N D
Introduction to LSST Data ManagementJeffrey KantorData Management Project Manager
LSST Data ManagementPrincipal Responsibilities • Archive Raw Data: Receive the incoming stream of images that the Camera system generates to archive the raw images. • Process to Data Products: Detect and alert on transient events within one minute of visit acquisition. Approximately once per year create and archive a Data Release, a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release. • Publish: Make all LSST data available through an interface that uses community-accepted standards, and facilitate user data analysis and production of user-defined data products at Data Access Centers (DACs) and external sites.
LSST From theUser’s Perspective • A stream of ~10 million time-domain events per night, detected and transmitted to event distribution networks within 60 seconds of observation. • A catalog of orbits for ~6 million bodies in the Solar System. • A catalog of ~37 billion objects (20B galaxies, 17B stars), ~7 trillion observations (“sources”), and ~30 trillion measurements (“forced sources”), produced annually, accessible through online databases. • Deep co-added images. • Services and computing resources at the Data Access Centers to enable user-specified custom processing and analysis. • Software and APIs enabling development of analysis codes. Level 1 Level 2 Level 3
02C.06.01 Science Data Archive (Images, Alerts, Catalogs) 02C.01.02.01, 02C.02.01.04, 02C.03, 02C.04 Alert, SDQA, Calibration, Data Release Productions/Pipelines 02C.03.05, 02C.04.07 Application Framework Data ManagementSystem Architecture • Application Layer (LDM-151) • Scientific Layer • Pipelines constructed from reusable, • standard “parts”, i.e. Application Framework • Data Products representations standardized • Metadata extendable without schema change • Object-oriented, python, C++ Custom Software 02C.05 Science User Interface and Analysis Tools 02C.01.02.02 - 03 SDQA and Science Pipeline Toolkits • Middleware Layer (LDM-152) • Portability to clusters, grid, other • Provide standard services so applications • behave consistently (e.g. provenance) • Preserve performance (<1% overhead) • Custom Software on top of Open Source, Off-the-shelf Software 02C.06.02 Data Access Services 02C.07.01, 02C.06.03 Processing Middleware 02C.07.02 Infrastructure Services (System Administration, Operations, Security) • Infrastructure Layer (LDM-129) • Distributed Platform • Different sites specialized for real-time • alerting, data release production,peta-scale data access • Off-the-shelf, Commercial Hardware & • Software, Custom Integration 02C.07.04.01 Archive Site 02C.07.04.02 Base Site 02C.08.03 Long-Haul Communications Physical Plant (included in above) Data Management System Design (LDM-148)
Mapping Data Productsinto Pipelines • 02C.01.02.01/02. Data Quality Assessment Pipelines • 02C.01.02.04. Calibration Products Production Pipelines • 02C.03.01. Instrumental Signature Removal Pipeline • 02C.03.01. Single-Frame Processing Pipeline • 02C.03.04. Image Differencing Pipeline • 02C.03.03. Alert Generation Pipeline • 02C.03.06. Moving Object Pipeline • 02C.04.04. CoadditionPipeline • 02C.04.04/.05 Association and Detection Pipelines • 02C.04.06. Object Characterization Pipeline • 02C.04.03. PSF Estimation • 02C.01.02.03. Science Pipeline Toolkit • 02C.03.05/04.07 Common Application Framework Level 1 Level 2 L3 Data Management Applications Design (LDM-151)
Infrastructure: Petascale Computing, Gbps Networks • The computing cluster at the LSST Archive at NCSA will run the processing pipelines. • Single-user, single-application data center • Commodity computing clusters. • Distributed file system for scaling and hierarchical storage • Local-attached, shared-nothing storage when high bandwidth needed Archive Site and U.S. Data Access Center NCSA, Champaign, IL • Long Haul Networksto transport data from Chile to the U.S. • 2x100 Gbps from Summit to La Serena (new fiber) • 2x40 Gbps for La Serena to Champaign, IL (path diverse, existing fiber) Base Site and Chilean Data Access Center La Serena, Chile
Middleware Layer: IsolatingHardware, Orchestrating Software • Enabling execution of science pipelines on hundreds of thousands of cores. • Frameworks to construct pipelines out of basic algorithmic components • Orchestration of execution on thousands of cores • Control and monitoring of the whole DM System • Isolating the science pipelines from details of underlying hardware • Services used by applications to access/produce data and communicate • "Common denominator" interfaces handle changing underlying technologies Data Management Middleware Design (LDM-152)
Database and Science UI:Delivering to Users • Massively parallel, distributed, fault-tolerant relational database. • To be built on existing, robust, well-understood, technologies (MySQL and xrootd) • Commodity hardware, open source • Advanced prototype in existence (qserv) • Science User Interface to enable the access to and analysis of LSST data • Web and machine interfaces to LSST databases • Visualization and analysis capabilities More: Talks by Becla, Van Dyk
Critical Prototypes:Algorithms and Technologies • Algorithm Design • Approximately 60% of the software functional capability has been prototyped • Over 350,000 lines of c++, python coded, unit tested, integrated, run in production mode • Have released three terabyte-scale datasets, including single frame measurements, point source and galaxy photometry • Pre-cursors leveraged • Pan-STARRS, SDSS, HSC • Petascale Computing Design • Executed in parallel on up to 10k cores (TeraGrid/XSEDE and NCSA Blue Waters hardware) with scalable results • Gigascale Network Design • Currently testing at up to 1 Gbps • Agreements in principle are in hand with key infrastructure providers (NCSA, FIU/AmPath, REUNA, IN2P3) • Petascale Database Design • Conducted parallel database tests up to 300 nodes, 100 TB of data, 100% of scale for operations year 1
Data Management Scope is Definedand Requirements are Established • Data Product requirements have been vetted with Science Collaborations multiple times and have successfully passed review (Jul ‘13) • Data quality and algorithmic assessments are far advanced and we understand the risks, successfully passed review (Sep ‘13) • Hardware sizing has been refreshed based on latest scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy • Interfaces are defined to Phase 2 level • Requirements and Final Design have been baselined (Data Management Technical Control Team) • Traceability from OSS to DMSR has been verified • All WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented
Data Management ICDs needed for Construction start are at Phase 2 Level √ under formal change control in progress (Phase 1) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ ICDs on Confluence: http://ls.st/mmm Docushare: http://ls.st/col-1033
Going Where the Talent is:Distributed Team User Interfaces Database Mgmt, I&T, and Science QA Science Pipelines Middleware Infrastructure
Data Management Organization Project Manager LSST DM Leadership J. Kantor Project Scientist M. Juric • DM Leadinstitutions are integrated into one project and are performing in their construction roles/responsibilities System Architecture Alert Production Survey Science Group Science User Interface & Tools Science Database & Data Acc Services Data Release Production Processing Services &Site Infrastructure International Comms/Base Site R. Lupton J. SwinbankPrinceton K-T. Lim G. Dubois-Felsmann SLAC R. Lambert NOAO D. PetravickNCSA J. Becla SLAC X. Wu D. Ciardi IPAC Connolly UW/OPEN SSG Lead Scientist TBD F. Economou LSST Data Management Organization document-139
Leveraging national andinternational investments • NSF/OCI Funded • Formal relationships continue with the IRNC-funded AmLight project and they are the lead entity in securing Chile - US network capacity for LSST • We have leveraged significant XSEDE and Blue Waters Service Unit and storage allocations for critical R&D phase prototypes and productions • Our LSST Archive Center and US Data Access Center will hosted in the National Petascale Computing Facility at NCSA • A strong relationship has been established with the Condor Group at the University of Wisconsin and HTCondor is now in our processing middleware baseline • We have reused a wide range of open source software libraries and tools, many of which received seed funding from the NSF • Other National/International Funded • We have participated in joint development of astronomical software with Pan-STARRS and HSC • We have fostered collaborative development of scientific database technology via the eXtremely Large Data Base (XLDB) conferences and collaborations with database developers (e.g. SciDB, MySQL, MonetDB) • We have a deep process of community engagement to deliver products that are needed, and an architecture to allow the community to deliver their own tools
Data Management isConstruction Ready • The Data Management System is scoped and credibly estimated • Requirements have been baselined and are achievable (LSE-61) • Final Design baselined (LDM-148, -151, 152, -129, -135) • Approximately 60% of the software functional capability has been prototyped • Data and algorithmic assessments are far advanced and we understand the risks • Hardware sizing has been done based on scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy • All lowest level WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented • All lead institutions are demonstrably integrated into one project and are performing in their construction roles/responsibilities • Core lead technical personnel are on board at all institutions • Agreements in principle are in hand with key technology and center providers (NCSA, NOAO, FIU/AmPath, REUNA) • The software development process has been exercised fully • Have successfully executed eight software and data releases • Standard/formal processes, tools, environment exercised repeatedly and refined • Automated build, test environment is configured and exercised nightly/weekly • Data Management PMCS plans current and complete