Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager

Introduction to LSST Data ManagementJeffrey KantorData Management Project Manager

LSST Data ManagementPrincipal Responsibilities • Archive Raw Data: Receive the incoming stream of images that the Camera system generates to archive the raw images. • Process to Data Products: Detect and alert on transient events within one minute of visit acquisition. Approximately once per year create and archive a Data Release, a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release. • Publish: Make all LSST data available through an interface that uses community-accepted standards, and facilitate user data analysis and production of user-defined data products at Data Access Centers (DACs) and external sites.

LSST From theUser’s Perspective • A stream of ~10 million time-domain events per night, detected and transmitted to event distribution networks within 60 seconds of observation. • A catalog of orbits for ~6 million bodies in the Solar System. • A catalog of ~37 billion objects (20B galaxies, 17B stars), ~7 trillion observations (“sources”), and ~30 trillion measurements (“forced sources”), produced annually, accessible through online databases. • Deep co-added images. • Services and computing resources at the Data Access Centers to enable user-specified custom processing and analysis. • Software and APIs enabling development of analysis codes. Level 1 Level 2 Level 3

02C.06.01 Science Data Archive (Images, Alerts, Catalogs) 02C.01.02.01, 02C.02.01.04, 02C.03, 02C.04 Alert, SDQA, Calibration, Data Release Productions/Pipelines 02C.03.05, 02C.04.07 Application Framework Data ManagementSystem Architecture • Application Layer (LDM-151) • Scientific Layer • Pipelines constructed from reusable, • standard “parts”, i.e. Application Framework • Data Products representations standardized • Metadata extendable without schema change • Object-oriented, python, C++ Custom Software 02C.05 Science User Interface and Analysis Tools 02C.01.02.02 - 03 SDQA and Science Pipeline Toolkits • Middleware Layer (LDM-152) • Portability to clusters, grid, other • Provide standard services so applications • behave consistently (e.g. provenance) • Preserve performance (<1% overhead) • Custom Software on top of Open Source, Off-the-shelf Software 02C.06.02 Data Access Services 02C.07.01, 02C.06.03 Processing Middleware 02C.07.02 Infrastructure Services (System Administration, Operations, Security) • Infrastructure Layer (LDM-129) • Distributed Platform • Different sites specialized for real-time • alerting, data release production,peta-scale data access • Off-the-shelf, Commercial Hardware & • Software, Custom Integration 02C.07.04.01 Archive Site 02C.07.04.02 Base Site 02C.08.03 Long-Haul Communications Physical Plant (included in above) Data Management System Design (LDM-148)

Mapping Data Productsinto Pipelines • 02C.01.02.01/02. Data Quality Assessment Pipelines • 02C.01.02.04. Calibration Products Production Pipelines • 02C.03.01. Instrumental Signature Removal Pipeline • 02C.03.01. Single-Frame Processing Pipeline • 02C.03.04. Image Differencing Pipeline • 02C.03.03. Alert Generation Pipeline • 02C.03.06. Moving Object Pipeline • 02C.04.04. CoadditionPipeline • 02C.04.04/.05 Association and Detection Pipelines • 02C.04.06. Object Characterization Pipeline • 02C.04.03. PSF Estimation • 02C.01.02.03. Science Pipeline Toolkit • 02C.03.05/04.07 Common Application Framework Level 1 Level 2 L3 Data Management Applications Design (LDM-151)

Infrastructure: Petascale Computing, Gbps Networks • The computing cluster at the LSST Archive at NCSA will run the processing pipelines. • Single-user, single-application data center • Commodity computing clusters. • Distributed file system for scaling and hierarchical storage • Local-attached, shared-nothing storage when high bandwidth needed Archive Site and U.S. Data Access Center NCSA, Champaign, IL • Long Haul Networksto transport data from Chile to the U.S. • 2x100 Gbps from Summit to La Serena (new fiber) • 2x40 Gbps for La Serena to Champaign, IL (path diverse, existing fiber) Base Site and Chilean Data Access Center La Serena, Chile

Middleware Layer: IsolatingHardware, Orchestrating Software • Enabling execution of science pipelines on hundreds of thousands of cores. • Frameworks to construct pipelines out of basic algorithmic components • Orchestration of execution on thousands of cores • Control and monitoring of the whole DM System • Isolating the science pipelines from details of underlying hardware • Services used by applications to access/produce data and communicate • "Common denominator" interfaces handle changing underlying technologies Data Management Middleware Design (LDM-152)

Database and Science UI:Delivering to Users • Massively parallel, distributed, fault-tolerant relational database. • To be built on existing, robust, well-understood, technologies (MySQL and xrootd) • Commodity hardware, open source • Advanced prototype in existence (qserv) • Science User Interface to enable the access to and analysis of LSST data • Web and machine interfaces to LSST databases • Visualization and analysis capabilities More: Talks by Becla, Van Dyk

Critical Prototypes:Algorithms and Technologies • Algorithm Design • Approximately 60% of the software functional capability has been prototyped • Over 350,000 lines of c++, python coded, unit tested, integrated, run in production mode • Have released three terabyte-scale datasets, including single frame measurements, point source and galaxy photometry • Pre-cursors leveraged • Pan-STARRS, SDSS, HSC • Petascale Computing Design • Executed in parallel on up to 10k cores (TeraGrid/XSEDE and NCSA Blue Waters hardware) with scalable results • Gigascale Network Design • Currently testing at up to 1 Gbps • Agreements in principle are in hand with key infrastructure providers (NCSA, FIU/AmPath, REUNA, IN2P3) • Petascale Database Design • Conducted parallel database tests up to 300 nodes, 100 TB of data, 100% of scale for operations year 1

Data Management Scope is Definedand Requirements are Established • Data Product requirements have been vetted with Science Collaborations multiple times and have successfully passed review (Jul ‘13) • Data quality and algorithmic assessments are far advanced and we understand the risks, successfully passed review (Sep ‘13) • Hardware sizing has been refreshed based on latest scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy • Interfaces are defined to Phase 2 level • Requirements and Final Design have been baselined (Data Management Technical Control Team) • Traceability from OSS to DMSR has been verified • All WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented

Data Management ICDs needed for Construction start are at Phase 2 Level √ under formal change control in progress (Phase 1) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ ICDs on Confluence: http://ls.st/mmm Docushare: http://ls.st/col-1033

Going Where the Talent is:Distributed Team User Interfaces Database Mgmt, I&T, and Science QA Science Pipelines Middleware Infrastructure

Data Management Organization Project Manager LSST DM Leadership J. Kantor Project Scientist M. Juric • DM Leadinstitutions are integrated into one project and are performing in their construction roles/responsibilities System Architecture Alert Production Survey Science Group Science User Interface & Tools Science Database & Data Acc Services Data Release Production Processing Services &Site Infrastructure International Comms/Base Site R. Lupton J. SwinbankPrinceton K-T. Lim G. Dubois-Felsmann SLAC R. Lambert NOAO D. PetravickNCSA J. Becla SLAC X. Wu D. Ciardi IPAC Connolly UW/OPEN SSG Lead Scientist TBD F. Economou LSST Data Management Organization document-139

Leveraging national andinternational investments • NSF/OCI Funded • Formal relationships continue with the IRNC-funded AmLight project and they are the lead entity in securing Chile - US network capacity for LSST • We have leveraged significant XSEDE and Blue Waters Service Unit and storage allocations for critical R&D phase prototypes and productions • Our LSST Archive Center and US Data Access Center will hosted in the National Petascale Computing Facility at NCSA • A strong relationship has been established with the Condor Group at the University of Wisconsin and HTCondor is now in our processing middleware baseline • We have reused a wide range of open source software libraries and tools, many of which received seed funding from the NSF • Other National/International Funded • We have participated in joint development of astronomical software with Pan-STARRS and HSC • We have fostered collaborative development of scientific database technology via the eXtremely Large Data Base (XLDB) conferences and collaborations with database developers (e.g. SciDB, MySQL, MonetDB) • We have a deep process of community engagement to deliver products that are needed, and an architecture to allow the community to deliver their own tools

Data Management isConstruction Ready • The Data Management System is scoped and credibly estimated • Requirements have been baselined and are achievable (LSE-61) • Final Design baselined (LDM-148, -151, 152, -129, -135) • Approximately 60% of the software functional capability has been prototyped • Data and algorithmic assessments are far advanced and we understand the risks • Hardware sizing has been done based on scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy • All lowest level WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented • All lead institutions are demonstrably integrated into one project and are performing in their construction roles/responsibilities • Core lead technical personnel are on board at all institutions • Agreements in principle are in hand with key technology and center providers (NCSA, NOAO, FIU/AmPath, REUNA) • The software development process has been exercised fully • Have successfully executed eight software and data releases • Standard/formal processes, tools, environment exercised repeatedly and refined • Automated build, test environment is configured and exercised nightly/weekly • Data Management PMCS plans current and complete

Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager

Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager

Presentation Transcript

LSST Project Update Jeff Kantor LSST Data Management Project Manager SAACC Meeting April 19, 2014

LSST Survey Data Products Mario Juric LSST Data Management Project Scientist Radio Astronomy in the LSST Era, NRAO May

Introduction to Data Management

Introduction to r esearch data management

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Introduction to Research Data Management

Introduction to Data Management

ICS 184: Introduction to Data Management

Introduction Data Management

Jeff Kantor LSST Data Management Systems Manager LSST Corporation Institute for Astronomy

L09: Introduction to XML Data Management

Introduction - GHRSST data management

Introduction to data management

CSE 2337 Introduction to Data Management

CSE 2337 Introduction to Data Management

ICS 184: Introduction to Data Management

Jeff Kantor LSST Corporation

Research Data Management Introduction

ICS 184: Introduction to Data Management

Research Data Management: introduction

The LSST Data management and French computing activities