250 likes | 382 Views
Data and Knowledge Grids. Chaitan Baru Co-Director, Data and Knowledge Systems SDSC. Introduction. SDSC is leading-edge site of NPACI SDSC is one of the nodes in the TeraGrid
E N D
Data and Knowledge Grids Chaitan Baru Co-Director, Data and Knowledge Systems SDSC
Introduction • SDSC is leading-edge site of NPACI • SDSC is one of the nodes in the TeraGrid • SDSC, via NPACI thrust areas, works with a number of applications—Earth System Science, Neuroscience, Molecular Biology, Digital Sky, … • SDSC works on a number of non-NPACI (including, industry) projects • The DAKS program receives 80% of funding from non-NPACI sources • The SDSC DAKS Program co-leads the data activities in Cal-(IT)2 via the SDSC/Cal-(IT)2 Data and Knowledge Engineering Lab
Introduction • The SDSC Data and Knowledge Systems (DAKS) program is unique in the nation. It supports: • Computer Science R&D • Applications-driven research • Development of robust software systems • Production data and visualization systems • Involved in Grid-based computing… • (very) High speed networking, fewer, high-performance nodes, “big”, possibly complex, data • …also, Internet-based computing • Web clients, Web databases and mediation, Web services, e.g. the Information Integration Testbed (I2T) Project • Web-based grid computing
Sensornets (real-time data, video streams) • ROADNet • ActiveCampus • Monitoring Health of Civil Infrastructure DAKSTechnology Layers Applications: Ecoinformatics, environmental science… Visualization Data Mining, Simulation Modeling, Analysis, Data Fusion Knowledge-Based Integration Advanced Query Processing Grid Storage (Curated Database) Filesystems, Database Systems High speed networking Networked Storage (SAN) Storage hardware
Information Integration Testbed(NSF Digital Government/ITR grants) Clients • “Parameterized” views • Resource discovery • Service discovery XML-based Mediator • Mediation of geospatial information • Accuracy, resolution issues XML queries XML UDDI WSDL SOAP Java Servlets WSDL WSDL SOAP Sociology Workbench SOAP Stats Server XML Metadata files Oracle DBMS
Community Grid Projects • GriPhyN—Grid Physics Network (NSF ITR) • NVO—National Virtual Observatory (NSF ITR) • BIRN—Biomedical Informatics Research Network (NCRR/NIH) • GEON—GEOsciences Network
Request for “full sweep” of data (10’s-100’s TB) Recalibrate data GriPhyN: The LIGO Project • Use of COTS DBMS Store raw data and basic “products” 1000 Channels Of data, every 2-3 seconds Filtering Request for data Channels/Time (GB-TB) Result Data Analysis
Correlate across Catalogs Result Catalog A Data mining Catalog B Digital Sky ProjectsNational Virtual Observatory (NVO) Load into DBMS Image Analysis Sky Catalogs Digital images
BIRN • Integrating data from different brain mapping research sites • UCSD, UCLA, Caltech, Duke, Mass General, Harvard • Mouse and human brain • BIRN Data/Knowledge Grid • High-speed networking • Access to distributed data • Semantic mediation • Intra-species and inter-species queries • Visualization and analysis tools
Example of BIRN Federation Are there changes in axon diameter, and/or number, in the optic nerve of EAE animals, before the development of gross structural changes? Integrated View Integrated View Definition Mediator Wrapper Wrapper Wrapper Wrapper Web CaBP, Expasy Electron microscopy Histology MRI
BIRN Layered Architecture allows query access to descriptive and computed information from multiple sources allows exploration and manipulation of images and volumes Presentation/Visualization/Application Layer Data Integration Layer (Mediator) Computational Grid Virtual Data Grid (SRB) Network Layer provides file and collection-level access to any data from any source
GEON • An outcome of the Geoinformatics community workshops • GEON Geoscience Research Themes • Earth's Surface: The Critical Interface Among Humans, Water, the Atmosphere, and Tectonics • Biodiversity: Geoscience and Evolution • Exploring the 4D Architecture of Continents • GEON Information Technology Research • GEON “Deep” Data Modeling and Semantic Mediation of 4D data sets • 4D Visualization and Augmented Reality • Data grids and distributed computing
Geosciences R. Arrowsmith, Arizona State University N. Christensen, University of Wisconsin M. Crawford, Bryn Mawr C. Duffy, Pennsylvania State University C. Flessa, University of Arizona A. Gary, University of Utah B. Huber, Smithsonian Institution R. Keller, University of Texas El Paso A. Levander, Rice University M. Liu, University of Missouri C. Marshall, Harvard University D. McLaughlin, Massachusetts Institute of Technology C. Meertens, UNAVCO D. McLaughlin, MIT C. Meertens, UNAVCO J. Oldow, University of Idaho D. Seber, Cornell University A.K. Sinha, Virginia Tech W. Snyder, Boise State University H. Staudigel, Scripps Institution of Oceanography H. Wang, University of Wisconsin Information Technology M. Bailey, San Diego Supercomputer Center C. Baru, San Diego Supercomputer Center B. Ludaescher, San Diego Supercomputer Center P. Papadopoulos, San Diego Supercomputer Center Y. Papakonstantinou, University of California San Diego T. Smith, University of California Santa Barbara Education and Outreach M. Marlino, Digital Library for Earth System Education (DLESE) GEON Participants
Government USGS NASA NOAA NGDC State Geologists Association Academia IRIS Cal-(IT)2 Industry ESRI Oracle Sun Panoram GEON Partners
Where GEON Information Integration ExampleBiodiversity: The Paleobiology DatabaseCharles Marshall, Harvard Selection Criteria Biological Attributes “Locality” <name> WhereWhenWho Species A Species B Paleoenvironment Synonymy Tectonic Setting Museum holdings Paleogeography Phylogeny Paleolatitude Minerology International Timescale #1 Sequence Stratigraphy Body Mass International Timescale #2 Geochemistry Lithology When
Complex Multiple-World Integration Scenarios • Current database integration issues only address • Structural/Schema Conflicts • common semistructured data model (XML) • schema transformations/integration (XML queries & transforms) • Limited Query Capabilities • capability based rewriting (e.g., TSIMMIS) • These scenarios are “one-world” (e.g. electronic parts catalogs) or simple multiple world (e.g. “home buyer”) • Problem: Semantic mediation in complex multiple worlds • complex, disjoint, seemingly unrelated data • “hidden semantics” in complex, indirect relationships
Augmented Reality Facility (ARF)Simulation of database information overlaid on ground reality(Photograph of San Elijo Lagoon, San Diego County, CA)
Scaling the “Network” • Technology: hardware, software • Disseminating “best practices” • Keeping technologies and technological skills up to date
A Common Opportunity • Creating the Data Institute • Common distributed cyberinfrastructure for science communities • Much commonality in IT problems across domains • Support for training of scientists and data managers (“wetware”) • Training in DBMS, GIS, Web, Wireless, Taxonomic DB, Metadata • IT state-of-the-art moves quickly • Dedicated, funded center to develop/modify existing technology • Some requirements of science applications are not directly addressed by commercial technology • “Riding the market” • Leverage industry linkages and commercial technology
A Common Opportunity • Creating the Data Institute • Information clearinghouse/digital library • Leverage what SDSC/Cal-(IT)2 is already doing • Long-term preservation/sustenance of data and software tools • Leverage SDSC’s work with the National Archives and Records Administration (NARA), Library of Congress (LoC), and California Digital Library (CDL) • National Ecological Data Archive • Create sustained community services • E.g. Science UDDI (Universal Description, Discovery, and Integration)
Thematic Views Disciplinary Views Geophysics, Petrology, Tectonics, Geology, Paleontology,... Earth's Surface GEON Discovery Center (Portal) Biodiversity 4D Continental Architecture Virtual Collections Virtual Collections Virtual Collections GEOSCIENCES Community GEON GEON Participants Dataset Providers Tool Providers Collection Providers Mediation Teams (View Providers) GEON Interdisciplinary Themes Integrated Views Services Visualization, Digital Library, Collaboration Knowledge-Based Integration / Semantic Mediation Domain maps, process maps GEON Collections GEON Data Grid Services Authentication, distributed data management, persistent archives Storage / Networks / Computers USGS IRIS ADEPT/ADL NASA UNAVCO DLESE HPSS NGDC/NOAA NSDL SAN Linux Clusters
IF THEN IF THEN IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... A = (B*|C),D B = ... . . .... .... .... XML Elements .... (XML) Objects Raw Data Raw Data ConceptualModels Raw Data Structural vs. Model-Based Mediation Integrated-DTD := XQuery(Src1-DTD,...) Integrated-CM := CM-QL(Src1-CM,...) DOMAIN MAP Logical Domain Constraints No Domain Constraints Classes, Relations, is-a, has-a, ... C1 C2 R C3 XML Models
C, C++, Linux I/O Unix Shell SRB Databases DB2, Oracle, Sybase Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Metadata Catalog Application Resource, User Java, NT Browsers Prolog Predicate Third-party copy Web User Defined Remote Proxies MCAT HRM Dublin Core DataCutter Application Meta-data
OC-12 vBNS Abilene MREN OC-12 OC-3 TeraGrid: 13.6 TF, 6.8 TB memory, 79 TB internal disk, 576 network disk ANL 1 TF .25 TB Memory 25 TB disk Caltech 0.5 TF .4 TB Memory 86 TB disk Extreme Blk Diamond 574p IA-32 Chiba City 256p HP X-Class 32 32 24 32 32 128p HP V2500 128p Origin 24 32 24 92p IA-32 32 HR Display & VR Facilities 5 4 8 5 8 HPSS HPSS NTON OC-48 OC-12 Calren ESnet HSCC MREN/Abilene Starlight Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) Juniper M160 OC-12 ATM OC-48 OC-12 GbE NCSA 6+2 TF 4 TB Memory 240 TB disk SDSC 4.1 TF 2 TB Memory 225 TB SAN vBNS Abilene Calren ESnet OC-12 OC-12 OC-12 OC-3 Myrinet 4 8 HPSS 300 TB UniTree 2 Myrinet 4 10 1024p IA-32 320p IA-64 1176p IBM SP 1.7 TFLOPs Blue Horizon 14 Sun Server 15xxp Origin 4 16 2 x Sun E10K
SDSC “node” configured to be best site for data-oriented computing in the world Argonne 1 TF 0.25 TB Memory 25 TB disk Caltech 0.5 TF 0.4 TB Memory 86 TB disk TeraGrid Backbone (40 Gbps) vBNS Abilene Calren ESnet NCSA 8 TF 4 TB Memory 240 TB disk HPSS 300 TB Myrinet Clos Spine Sun SDSC 4.1 TFLOP 2 TB Memory ~25 TB internal disk ~225 TB network disk Blue Horizon IBM SP 1.7 TFLOPs 2 x Sun E10K