710 likes | 873 Views
The GRID Adventures: SDSC's Storage Resource Broker and Web Services in Digital Library Applications. Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU San Diego Supercomputer Center University of California, San Diego. Staff Reagan Moore Chaitan Baru.
E N D
The GRID Adventures: SDSC's Storage Resource Broker and Web Services in Digital Library Applications Arcot Rajasekar, Reagan Moore, Bertram Ludäscher,Ilya Zaslavsky ZASLAVSK@SDSC.EDU San Diego Supercomputer Center University of California, San Diego
Staff Reagan Moore Chaitan Baru Data and Knowledge Systems • Data Mining Lab (Tony Fountain) • Advanced Query Processing Lab (Amarnath Gupta) • Knowledge-Based Integration Lab (Bertram Ludäscher) • Data Grid Lab (Arcot Rajasekar) • Spatial Information Systems Lab (Ilya Zaslavsky) • + 2-3 programmers in each lab, + graduate and undergraduate students • Now: connecting research with production databases and data grid solutions
Overview • Intro • SDSC and NPACI • Part I: technologies • What is Data Grid • Data, Information, and Knowledge Infrastructures at SDSC/DICE • SDSC Storage Resource Broker, with examples • MIX (Mediation of Information Using XML), and Knowledge-Based Mediation • Part II: case studies • BIRN: the First Operational Data Grid • Web Services Demos • Persistent Archives at SDSC • Summary
A Distributed National Laboratory for Computational Science and Engineering
1st Teraflops System for US Academia • 1 TFLOPs IBM SP • 144 8-processor compute nodes • 12 2-processor service nodes • 1,176 Power3 processors at 222 MHz • Initially > 640 GB memory (4 GB/node), upgrade to > 1 TB later • 6.8 TB switch-attached disk storage • Largest SP with 8-way nodes • High-performance access to HPSS Nov 1999
Bioinformatics Infrastructure for Large-Scale Analyses • Next-generation tools for accessing, manipulating, and analyzing biological data • Biology, Stanford University • DICE, SDSC • Analysis of Protein Data Bank, GenBank and other databases • Accelerate key discoveries for health and medicine • Supporting and leveraging new data grid projects, such as BIRN in biology
SRB Part I: technologies What is Data Grid Data, Information, and Knowledge Infrastructures at SDSC/DICE SDSC Storage Resource Broker MIX (Mediation of Information Using XML), and Knowledge-Based Mediation
What are Data Grids? • Power Grid Analogy • Multiple power generators • Complex transmission networks with switching • Simple Usage Interface – plug and play • Guaranteed Supply - Meeting of demands (peak and lull) • Complex cost function • More than one data provider • Best movement of data across computer networks • Seamless Access to Data with good ‘Finding Aids’ • Guarantee of Data Access • Access Control, Quotas & Complex Usage Costing
Data Grids Data Grid - linking multiple data collections Separate name spaces Separate schema Separate administration domains Heterogeneous database instances Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and manage semantics
Federated Digital Libraries Virtual Data Grid - linking multiple data collections Ability to execute processes to recreate derived data Database A Services Virtual Data Grid Database B Services The virtual data grid integrates data grid and digital library technology to manage processes
Why Data Grids: Data Handling Problems • Large Datasets; Large Number of Datasets; Scaling • Distributed, Heterogeneous Storage • Virtualization & Transparency • Collaboration, Access Control, Authentication, Security • Replication, Coherency, Synchronization • Fault Tolerance and Load Distribution • Scheduling, Caching & Data Placements • Data Migration over Time & Space • Data/Collection Curation • Uniform Name Space • Handling Legacy Data and Data/Resource Evolution • User-friendly Interfaces – foster collaborations
Why Data Grids: Metadata Problems • Types of Metadata – Relational to XML to unstructured • Standardized to User-defined Metadata • Large Number of Attributes; • Large Size; Scaling • Federation - integration over space • Evolution - integration over time • Evolution - integration over contexts • Discovery and Search • Presentation – user friendly • Extraction and Maintenance
DAKS Data Management Hierarchy • Model-Based Information Management • Rule-based ontology mapping, conceptual-level mediation - CMIX • Information Mediation • Data federation across multiple libraries - MIX • Digital Library • Interoperable services for information discovery and presentation - SDLIP • Data Collection • Tools for managing data set collections on databases - MCAT • Data Handling • Systems for data retrieval from remote storage - SRB • Persistent Archives • Storage of data collections for 30+ years
Distributed Storage Resources (database systems, archival storage systems, file systems, ftp, http, …) SRB as a Solution • The Storage Resource Broker is a middleware • It virtualizes resource access • It mediates access to distributed heterogeneous resources • It uses a MetaCATalog to facilitate the brokering • It integrates data and metadata MCAT Application SRB Server HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP
Application Resource, Mthd, User Java, NT Browsers Prolog Predicate C, C++, Linux I/O Unix Shell Metadata Extraction Web User Defined SRB Remote Proxies MCAT Databases DB2, Oracle, Sybase Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX HRM Dublin Core DataCutter Application Meta-data Solution SRB SDSC Storage Resource Broker & Meta-data Catalog
DR DR DL DL DR Client Client Client Client Client Client DR DR MC DR DL SRB Space SRB SRB SRB SRB SRB SRB SRB DL DR - Data Repository DL - Dig Library MC - Meta Catalog SRB SRB SRB
MySRB: Web-bases Access to the SRB • Browse in Hierarchical Collections • Registration of (remote) Legacy Files & Directories • Registration of SQL Objects • Registration of URLs • Data Movement Operations • Ingest & Re-Ingest, Delete, Unlink • Replicate, Copy, Move, S-Link • Access Control Operations • Read, Write, Own, Curate, Annotate, … • Ticket-based Access • Version Control Operations • Read Lock, Write Lock, Unlock • Check In Check Out
Meta data Management in MySRB • Types of Meta Data • System-level Metadata • Size, resource, owner, date, access control, … • User-defined Meta data • for data & collections • <name,value,unit> triples • No limits in number of metadata • Support for Collection-level schemas • Comments, default values, drop-down lists • Support for Standardized Schemas • (eg. Dublin Core) • Annotations • Supports textual annotations • Annotator, date, context also registered
SRB Projects • Digital Libraries • UCB, Umich, UCSB, Stanford,CDL • NSF NSDL - UCAR / DLESE • NASA Information Power Grid • DOE ASCI Data Visualization Corridor • Astronomy • National Virtual Observatory • 2MASS Project (2 Micron All Sky Survey) • Particle Physics • Particle Physics Data Grid (DOE) • GriPhyN • SLAC Synchrotron Data Repository • Medicine • Visible Embryo (NLM) • Earth Systems Sciences • ESIPS • LTER • Persistent Archives • NARA • LOC • Neuro Science & Molecular Science • TeleScience, Brain Images, BIRN • JCSG (SSRL/SLAC), AfCS, …
Large Data Project Examples • Astronomy: • National Virtual Observatory • Integrate 18 sky surveys- (ITR prop) • 2MASS Project (2 Micron All Sky Survey) • 10TB; 5million files • Co-locate Images for Spatial Access • Data Mining across entire collection • Replicate to CalTech HPSS • Particle Physics: • Particle Physics Data Grid (DOE) • GrPhyN (NSF ITR proj) • CERN LHC 1PB/yr (1billion obj) • Multi-Lab integration • SLAC Synchrotron Data Repository
National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.
Digital Sky Data Ingestion Data Cache SRB SUN E10K star catalog Informix SUN HPSS 800 GB …. input tapes from telescopes 10 TB SDSC IPAC CALTECH
The input data was on tapes in a random (temporal…) order. Ingestion nearly 1.5 year - almost continuous, 4 parallel streams (4 MB/sec per stream), 24*7*365 Total 10+TB, 5 million, 2 MB images in 147,000 containers. SRB performed a spatial sort on data insertion (Scientists view/analyze data by neighborhood). The disc cache (800 GB) for the HPSS containers was utilized. Ingestion speed limited by input tape reads Only two tapes per day can be read Work flow incorporated persistent features to deal with network outages and other failures. C API was utilized for fine grain control and to be able to manipulate and insert metadata into Informix catalog at IPAC Caltech. http://www.ipac.caltech.edu/2mass Digital Sky Data Ingestion
DigSky Conclusion • SRB can handle large number of files • Metadata access is still less than ½ sec delay • Replication of large collections • Single command for geographical replication • On-the-fly sorting (out-of-tape sorting) • Availability of data otherwise not possible • Near-line access to 5 million files (10 TB) • Successfully used in web-access & large scale analysis (daily)
Demonstration • goto mySRB • For Additional Information: http://www.npaci.edu/dice/srb srb@sdsc.edu
Mediation of Information using XML (MIX) XML Query XML • Export: • Schema & Metadata • (DTD, RDF,…) • Capabilities XML View Document(s) XML View Document(s) XML View Document(s) Wrapper Wrapper Data Source (eg. home ads) Native XML Database Legacy Source
A Typical Mediation Scenario User Interface Query Results Mediator (integrated views over heterogeneous sources) Query “fragment” Query “fragment” Convert incoming query and outgoing data Wrapper Wrapper Wrapper SQL Database GIS HTML
The Home Buyer Scenario Web Client XMAS Query Results (XML) MIXm Mediator “Homes” mediator Data Data “Neighborhood” mediator National test scores Data “Schools” mediator Home info (real estate) Community info (name, ZIP) Crime info (ZIP, stats) N’hood info (demographics) Schools info (address, size) School district info (scores,spending,ZIP) www.sandag.cog.ca.us www.sannet.gov www.realtor.com www.asd.com www.homeadvisor.msn.com
An XML Query (XMAS) $C:<*.condo> <address zip=$Z/> </condo> AT www.condo.com AND $S:<*.school type=elementary> <address zip=$Z/> </school> AT schools.org <folder> $C $S for $S </folder> for $C <condosAndSchools> <folder> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <school> <name>La Jolla High</name> <address … zip=92037> </school> <school>…</school> </folder> ... <RealEstateAgent> <name>J. Smith</name> <condos> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <condos> </RealEstateAgent>
Home Buyer GUI (Answers) Generated XMAS Query XML Answer Document
User Query Mediator W1 W2 W3 S1 S2 S3 Our Research • In what query language does the user pose a query? • How does the query engine of the mediator rewrite the query? • How does the mediator combine/restructure/post-process partial results? • What data model and query transformation scheme should the wrappers use for different source types? For details: http://www.npaci.edu/DICE/MIX XMAS XML
New MIX Challenges from Scientific Applications • Complex Data • SDSC’s Scientific Data Applications(current/planned, e.g. Neurosciences: NCMIR, NIH BIRN, Earth sciences: GEON, GeoGrid, ...) show that syntactic/structural integration is insufficient for ... Complex Multiple-World Mediation Problems: • complex, disjoint, seemingly unrelated data • “hidden semantics” in complex, indirect relationships => Semantic (aka Model/Knowledge-Based) Mediation • lift mediation to the level of conceptual models (CMs) • use domain experts’ knowledge formalized as rules over CMs => Specialized Extensions • temporal, geospatial, statistical, DQ/accuracy... operations => Extend Mediation Scope and Power via Deductive Rules
An Unresolved ChallengeHow do nerve cells change as we learn and remember? A multi-resolution study of the rat hippocampus at Boston University
Dendritic spine morphology and its variations density = #spines/length Reconstructions from the Synapse Lab, Boston University
Hypothesis • Distribution of spines changes • with learning • Each spine type performs a different task in information transmission Next Questions • Does anyone else have corroborative evidence for these observations? • Are these observations true in other comparable parts of the brain? • Is this consistent with the distribution of Calcium-binding proteins? Observations • Spine density, size, shape and PSD vary with maturity • Spine neck geometry controls peak Calcium amount • Calcium flow parameters depend on the different subclasses of spines
Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). domain expert knowledge domain map equivalent Description Logic facts Example for Formalizing Domain Knowledge:Domain Map for SYNAPSE and NCMIR • A domain map comprises • Description Logic facts ... • - concepts ("classes") • - roles ("associations") • derived properties ... • ... expressed as logic rules • - (e.g. F-logic)
FL rule proc. LP rule proc. GCM GCM GCM Mediator Engine CM S1 CM S2 CM S3 XSB Engine Graph proc. CM-Wrapper CM-Wrapper CM-Wrapper XML-Wrapper XML-Wrapper XML-Wrapper S3 S1 S2 Extended Mediator Architecture for Semantic Mediation USER/Client CM (Integrated View) Domain Map DM Integrated View Definition IVD CM Plug-In CM Queries & Results (exchanged in XML) Logic API (capabilities)
Part II: case studies BIRN Web Services Persistent Archives
NIH is Funding a Brain Imaging Federated Repository Biomedical Informatics Research Network (BIRN) NIH Plans to Expand to Other Organs and Many Laboratories Part of the UCSD CRBSCenter for Research on Biological Structure National Partnership for Advanced Computational Infrastructure
Surface atlas, Van Essen Lab stereotaxic atlas LONI MCell, CNL, Salk CCB, Montana SU NCMIR, UCSD Infrastructure for Sharing Neuroscience Data • SOURCES: • NCMIR, U.C. San Diego • Caltech Neuroimaging • Center for Imaging Science, John Hopkins • Center for Computational Biology, Montana State • Laboratory of Neuro Imaging (LONI), UCLA • Computatuonal Neurobiology Laboratory, Salk Inst. • Van Essen Laboratory, Washington University • … • Data Management Infrastructure (DAKS/NPACI) • MIX Mediation in XML • MCAT information discovery • SRB data handling • HPSS storage • ... Knowledge-based GRID infrastructure ? ? ? ? Data Management Infrastructure (“Data Grid”) GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS
??? Integrated View ??? ??? Integrated View Definition ??? ???Mediator ??? The Need for Semantic Integration Cross-source queries What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Cross-source relationships are modeled Semantic (knowledge-based) mediation services Data, relationships, constraints are modeled (CMs) Wrapper Wrapper Wrapper Wrapper Web protein localization morphometry neurotransmission CaBP, Expasy
Purkinje Cell layer of Cerebellar Cortex Molecular layer of Cerebellar Cortex Fragment of dendrite Hidden Semantics: Protein Localization <protein_localization> <neuron type=“purkinje cell” /> <protein channel=“red”> <name>RyR</> …. </protein> <region h_grid_pos=“1” v_grid_pos=“A”> <density> <structure fraction=“0.8”> <name>spine</> <amount name=“RyR”>0</> </> <structure fraction=“0.2”> <name>branchlet</> <amount name=“RyR”>30</> </>
Mediation Services: Source Registration (System Issues) Source Data Type Query Capability Result Delivery Access Protocol ARC XML QL DOOD SQL tree file table HTTP JDBC SRB Tuple-at-a-time Stream Set-at-a-time SPJ Selections Binary for Viewer
Mediation Services: Source Registration (Semantics Issues) • Domain Map Registration • provide concept space/ontology • … as a private object (“myANATOM”) • … merge with others (give “semantic bridges”) • … and check for conflicts • Conceptual Model Registration • schema: classes, associations, attributes • domain constraints • “put data into context” (linking data to the domain map) Next