580 likes | 993 Views
NPACI Summer Institute. Data Management Introduction Real-life Experiences with Data Grids Reagan Moore Arcot Rajasekar (moore, sekar)@sdsc.edu. Summer Institute agenda Prepared presentations Preferred presentations on data access Your input Introduction to Data Intensive Computing
E N D
NPACI Summer Institute Data Management Introduction Real-life Experiences with Data Grids Reagan Moore Arcot Rajasekar (moore, sekar)@sdsc.edu
Summer Institute agenda Prepared presentations Preferred presentations on data access Your input Introduction to Data Intensive Computing Basic concepts Topics
Monday Introduction to data-intensive computing SRB introduction and overview mySRB web interface inQ browser interface Shell commands Presentations
Tuesday Building collections using the SRB GridPort interface to the SRB - automated collection building Databases - creating a catalog Teragrid data model Database federation Presentations
Wednesday Introduction to grid computing MGRID - case study Grid in 1 hour Telescience portal MCELL Introduction to parallel I/O NVO APST tutorial Presentations
Thursday Data mining Data visualization Friday Introduction to ROCKS Presentations
Survey of data management requirements Data intensive computing Common approach Basic concepts What’s real? Production systems Introduction to Data Intensive Computing
Data without a context is only useful to the owner Context is used to automate discovery, access, and use Data context can include a wide variety of types of information Provenance Descriptive Data model Structural Authenticity Basic Tenet
Fundamental concepts (computer science abstractions) underlie all of the data management requirements The fundamental concepts can be implemented as generic software 2nd Basic Tenet
Data collecting Sensor systems, object ring buffers and portals Data organization Collections, manage data context Data sharing Data grids, manage heterogeneity Data publication Digital libraries, support discovery Data preservation Persistent archives, manage technology evolution Data analysis Processing pipelines, knowledge generation Data Management RequirementsWhich do You Need?
Demonstrate how real-time data management, digital libraries, persistent archives, and analysis pipelines can be built on data grids Examine capabilities needed for each environment Present real-world examples from multiple scientific disciplines Common Infrastructure
Partnership for Advanced Computational Infrastructure - PACI Data grid - Storage Resource Broker Distributed Terascale Facility - DTF/ETF Compute, storage, network resources Digital Library Initiative, Phase II - DLI2 Publication, discovery, access Information Technology Research projects - ITR SCEC Southern California Earthquake Center GEON GeoSciences Network SEEK Science Environment for Ecological Knowledge GriPhyN Grid Physics Network NVO National Virtual Observatory National Middleware Initiative - NMI Hardening of grid technology (security, job execution, grid services) National Science Digital Library - NSDL Support for education curricula modules NSF Infrastructure Programs
NASA Information Power Grid - IPG Advanced Data Grid - ADG Data Management System - Data Assimilation Office Integration of DODS with Storage Resource Broker data grid Earth Observing Satellite EOS data pools Consortium of Earth Observing Satellites CEOS data grid Library of Congress National Digital Information Infrastructure and Preservation Program - NDIIPP National Archives and Records Administration and National Historical Public Records Commission Prototype persistent archives NIH Biomedical Informatics Research Network data grid DOE Particle Physics Data Grid Federal Infrastructure Programs
Hayden Planetarium Simulation & Visualization NVO -Digital Sky Project (NSF) ASCI - Data Visualization Corridor (DOE) Particle Physics Data Grid (DOE) {GrPhyN (NSF)} Information Power Grid (NASA) Biomedical Informatics Research Network (NIH) Knowledge Network for BioComplexity (NSF) Mol Science – JCSG, AfCS Visual Embryo Project (NLM) RoadNet (NSF) Earth System Sciences – CEED, Bionome, SIO Explorer Advanced Data Grid (NASA) Hyper LTER Grid Portal (NPACI) Tera Scale Computing (NSF) Long Term Archiving Project (NARA) Education – Transana (NPACI) NSDL – National Science Digital Library (NSF) Digital Libraries – ADL, Stanford, UMichigan, UBerkeley, CDL … 31 additional collaborations SDSC Collaborations
Logical name space Global persistent identifier Storage repository abstraction Standard operations supported on storage systems Information repository abstraction Standard operations to manage collections in databases Access abstraction Standard interface to support alternate APIs Latency management mechanisms Aggregation, parallel I/O, replication, caching Security interoperability GSSAPI, inter-realm authentication, collection-based authorization Data Grid Concepts
Generate “fly-through” of the evolution of the solar system Access data distributed across multiple administration domains Gigabyte files, total data size was 7 TBytes Very tight production schedule - 3 months Logical Name SpaceExample - Hayden Planetarium
Hayden Data Flow NCSA AMNH NYC SGI NY Production parameters, movies, images 2.5 TB UniTree data simulation SDSC CalTech GPFS 7.5 TB IBM SP2 BIRN HPSS 7.5 TB UVa visualization
Global, location-independent identifiers for digital entities Organized as collection hierarchy Attributes mapped to logical name space Attributed managed in a database Types of system metadata Physical location of file Owner, size, creation time, update time Access controls Logical Name Space
Logical name Global identifier for virtual organization Unique identifier Handle or OID unique across virtual organizations Descriptive name Descriptive attributes for discovery Physical name Physical entity name which varies between locations Identifiers
Federated SRB server model Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access
Set of operations used to manipulate data Manage data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) Virtual Object Ring Buffers Storage Repository Abstraction
Byte level access Unix semantics Latency management Bulk operations Object oriented storage Movement of application to the data Access to heterogeneous systems Protocol conversion Storage Repository Abstraction
Unix File System operations creat(), open(), close(), unlink() read(), write(), seek(), sync() stat(), fstat(), chmod() mkdir(), rmdir(), opendir(), closedir(), readdir() Application drivers Management of file structure at remote site Paging systems for visualization Pre-fetch (partial file read) Byte Level Operations
The SRB was used as a central repository for all original, processed or rendered data. Location transparency crucial for data storage, data sharing and easy collaborations. SRB successfully used for a commercial project in “impossible” production deadline situation dictated by marketing department. Collaboration across sites made feasible with SRB Hayden Conclusions
Demonstrate the ability to load collections at terascale rates Large number of digital entities Terabyte sized data Optimize interactions with the HPSS High Performance Storage System Server-initiated I/O Parallel I/O Latency ManagementExample - ASCI - DOE
Bulk data load Bulk data access Bulk registration of files Aggregation into a container Extraction from a container Staging, required by the Hierarchical Resource Manager Status, required by the Hierarchical Resource Manager Latency Management
SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O
Data movement across 3 hosts ASCI Data Flow applications SRB server SRB clients data cache local FS SRB server MCAT Oracle HPSS
Ingest a very large number of small files into SRB time consuming if the files are ingested one at a time Bulk ingestion to improve performance Ingestion broken down into two parts the registration of files with MCAT the I/O operations (file I/O and network data transfer) Multi-threading was used for both the registration and I/O operations. Sbload was created for this purpose. reduced the ASCI benchmark time of ingesting ~2,100 files from ~2.5 hours to ~7 seconds. ASCI Small Files
2MASS (2 Micron All Sky Survey): Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen-Piao Lee, IPAC, Caltech NVO (National Virtual Observatory): Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech SDSC – SRB : Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore Latency ManagementExample - Digital Sky Project
Digital Sky Data Ingestion SRB SUN E10K Data Cache star catalog Informix SUN HPSS 800 GB …. input tapes from telescopes 10 TB SDSC IPAC CALTECH
http://www.ipac.caltech.edu/2mass The input data was originally written to DLT tapes in the order seen by the telescope 10 TBytes of data, 5 million files Ingestion took nearly 1.5 years - almost continuous reading of tapes retrieved from a closet, one at a time Images aggregated into 147,000 containers by SRB Digital Sky - 2MASS
Images sorted by spatial location Retrieving one container accesses related images Minimizes impact on archive name space HPSS stores 680 Tbytes in 17 million files Minimizes distribution of images across tapes Bulk unload by transport of containers Containers
average 3000 images a day Digital Sky Web-based Data Retrieval SRB SUN E10K Informix SUNs WEB IPAC CALTECH HPSS 800 GB WEB SUNs SGIs …. 10 TB JPL SDSC
Execution of defined operations directly at the storage system Metadata extraction from files Extraction of a file from a container Validation of a digital signature Data subsetting Data filtering Server initiated parallel I/O streams Encryption as a property of the data file Compression as a property of the data file Moving Application to the Data
Extract image cutout from Digital Palomar Sky Survey Image size 1 Gbyte Shipped image to server for extracting cutout took 2-4 minutes (5-10 Mbytes/sec) Remote proxy performed cutout directly on storage repository Extracted cutout by partial file reads Image cutouts returned in 1-2 seconds Remote proxies are a mechanism to aggregate I/O commands Remote Proxies
Manage interactions with a virtual object ring buffer Demonstrate federation of ORBs Demonstrate integration of archives, VORBs and file systems Support queries on objects in VORBs Real-Time DataExample - RoadNet Project
Database blob access Database metadata access Object ring buffer access Archive access Hierarchical Resource Manager access http access Preferred API - Python, Java, C library, Shell command, OAI, WSDL, OGSA, http, DLL Heterogeneous Systems
Federated VORB Operation Logical Name of the Sensor wiith Stream Characteristics Automatically Contact ORB2 Through VORB server At Nome Get Sensor Data ( from Boston) VORB server 1 VORB agent VORB server 4 San Diego Check ORB1 ORB1 is down VORB agent 3 2 6 ORB1 Nome VCAT 5 Format Data and Transfer Contact VORB Catalog: 1.Logical-to-Physical mapping Physical Sensors Identified 2. Identification of Replicas ORB1 and ORB2 are identified as sources of reqd. data 3.Access & Audit Control ORB2 R2 Check ORB2 ORB2 is up. Get Data
Information Abstraction Example - Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin: host, path, owner, uid, gid, perm_mask, [times] Ingestion: date, user, user_email, comment Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD
Data Management System: Software Architecture
DODS Access Environment Integration
Consistency constraints in federations Cross-register a digital entity from one collection into another Who controls access control lists? Who updates the metadata? Grid bricks versus tape archives Persistent collections Peer-to-Peer Federated Systems
Execution of a service creates state information Map state information onto logical name space Associate state information with a digital entity Manage state information in a registry Consistency constraints on management of state information when changes are made Grid Services
C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog - Access Abstraction Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer, Informix Servers HRM
User distinguished name Certificate authority Resource logical name - resources themselves MDS Grid services registry File logical name Replica catalog Application abstraction - functions on resources Virtual data grid Grid services registry Logical Name Spaces
Define logical resource name List of physical resources Replication Write to logical resource completes when all physical resources have a copy Load balancing Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance Write to a logical resource completes when copies exist on “k” of “n” physical resources Mappings on Name Space
Service state information Partial completion of service State information update in registry Synchronous, asynchronous, deferred Service composition Order of execution Virtual organization - need definition Specification of set of registries Federation of virtual organizations Relationships between registries Consistency Constraints