280 likes | 393 Views
Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006. Anke Kamrath Division Director, San Diego Supercomputer Center kamratha@sdsc.edu. The Digital World. Entertainment. Shopping. Information. GAMESS. Geosciences. Data Management and Mining.
E N D
Welcome andCyberinfrastructure OverviewMSI Cyberinfrastructure InstituteJune 26-30, 2006 Anke Kamrath Division Director, San Diego Supercomputer Center kamratha@sdsc.edu
The Digital World Entertainment Shopping Information
GAMESS Geosciences Data Managementand Mining Astronomy Physics QCD Modeling and Simulation Science is a Team Sport Life Sciences
Cyberinfrastructure – A Unifying Concept Cyberinfrastructure= resources(computers, data storage, networks, scientific instruments, experts, etc.) + “glue”(integrating software, systems, and organizations). NSF’s “Atkins Report” provided a compelling vision for integrated Cyberinfrastructure
Data from instruments Data from sensors Data from simulations Data from analysis A Deluge of Data • Today data comes from everywhere • “Volunteer” data • Scientific instruments • Experiments • Sensors and sensornets • Computer simulations • New devices (personal digital devices, computer-enabled clothing, cars, …) • And is used by everyone • Researchers, educators • Consumers • Practitioners • General public • Turning the deluge of data into usable information for the research and education community requires an unprecedented level of integration, globalization, scale, and access Volunteer data
SRB Summer Institute IT Using Data as a Driver: SDSC Cyberinfrastructure Community Databasesand Data Collections,Data management, mining and preservation Data-oriented HPC, Resources, High-end storage, Large-scale data analysis, simulation, modeling Biology Workbench SDSCData Cyberinfrastructure Data-oriented Tools, SW Applications, and Community Codes Data- and Computational Science Education and Training Collaboration, Service and Community Leadership for Data-oriented Projects
wireless sensors field computer computer network network data computer data data storage computer viz network fieldinstrument Impact on Technology: Data and Storage are Integral to Today’s Information Infrastructure • Today’s “computer” is a coordinated set of hardware, software, and services providing an “end-to-end” resource. • Cyberinfrastructure captures how the research and education community has redefined “computer” Data and storage are an integral part of today’s “computer”
Access to community and reference data collections More capable and/or higher capacity computational resources Community codes, middleware, software tools and toolkits Multi-disciplinary expertise Long-term Scienctific Data Preservation Building a National Data Cyberinfrastructure Center Goal: SDSC’s Data Cyberinfrastructure should “extend the reach” of the local research and education environment.
Data (more BYTES) Impact on Applications: Data-oriented Research Driving the Next Generation of Technology Challenges Data-oriented Research Applications Home, Lab, Campus, Desktop Applications TraditionalHPC Applications Compute (more FLOPS)
Data Mgt. Envt. Extreme I/O Environment Data-oriented Environment Climate SCEC Simulation SCEC Visualization ENZO simulation EOL NVO ENZO Visualization Turbulence field Lends itself to Grid GridSAT CFD CiPres Could be targeted efficiently on Grid MCell Seti@Home Data (more BYTES) Data (more BYTES) Difficult to target efficiently on Grid Protein Folding/MD Home, Lab, Campus, Desktop TraditionalHPC environment CPMD QCD GAMESS Turbulence Reattachment length EverQuest Compute (more FLOPS) Compute (more FLOPS) Today’s Research Applications Span the Spectrum
Working with Compute and Data –Simulation, Analysis, Modeling Resources Required Computers and Systems • 80,000 hours on DataStar • 256 GB memory p690 used for testing, p655s used for production run, TG used for porting • 30 TB Global Parallel file GPFS • Run-time 100 MB/s data transfer from GPFS to SAM-QFS • 27,000 hours post-processing for high resolution rendering People • 20+ people for IT support • 20+ people in domain research Storage • SAM-QFS archival storage • HPSS backup • SRB Collection with 1,000,000 files Simulation of Southern of 7.7 earthquake on lower San Andreas Fault • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m • Builds on 10 years of data and models from the Southern California Earthquake Center • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each • Simulation generates 45+ TB data
The Southern San Andreas Fault Big Data & Big Compute: Simulating an earthquake 1: • Divide up Southern California into “blocks” • For each block, getall the data on ground surface composition, geological structures, fault information, etc.
Big Data & Big Compute: Simulating earthquake 2: • Map the blocks on to processors(brains)of the computer SDSC’s DataStar – one of the 25 fastest computers in the world
Big Data & Big Compute: Simulating an earthquake 3: • Run the simulation using current information on fault activity and the physics of earthquakes
Managing the data Where to store the data? • In HPSS, a tape storage library that can hold 10 PetaByes (100000 Terabytes) -- 500 times the printed materials in the Library of Congress • Simulating an earthquake 4: • The simulation outputs data on seismic wave velocity, earthquake magnitude,and other characteristics • How much data was output? • 47 TeraByteswhich is • 2+ times the printed materials in the Library of Congress! or • The amount of music in 2000+ iPods! or • 47 million copies of a typical DVD movie!
How long will TeraShake take on your desktop computer? 72 centuries! (approximate)
Radiologists and neurosurgeons at Brigham and Women’s Hospital, Harvard Medical School exploring transmission of 30/40 MB brain images (generated during surgery) to SDSC for analysis and alignment Transmission repeated every hour during 6-8 hour surgery. Transmission and output must take on the order of minutes Finite element simulation on biomechanical model for volumetric deformation performed at SDSC; output results are sent to BWH where updated images are shown to surgeons Better Neurosurgery Through Cyberinfrastructure • PROBLEM:Neuro-surgeons seek to remove as much tumor tissue as possible while minimizing removal of healthy brain tissue • Brain deforms during surgery • Surgeons must align preoperative brain image with intra-operative images to provide surgeons the best opportunity for intra-surgical navigation
Community Data Repository: SDSC DataCentral • Provides “data allocations” on SDSC resources to national science and engineering community • Data collection and database hosting • Batch oriented access • Collection management services • First broad program of its kind to support research and community data collections and databases • Comprehensive resources • Disk:400 TB accessible via HPC systems, Web, SRB, GridFTP • Databases:DB2, Oracle, MySQL • SRB:Collection management • Tape:6 PB, accessible via file system, HPSS, Web, SRB, GridFTP • 24/7 operations, collection specialists Example Allocated Data Collections include • Bee Behavior (Behavioral Science) • C5 Landscape DB (Art) • Molecular Recognition Database(Pharmaceutical Sciences) • LIDAR (Geoscience) • AMANDA (Physics) • SIO_Explorer (Oceanography) • Tsunami and Landsat Data (Earthquake Engineering) • Terabridge (Structural Engineering) DataCentral infrastructure includes: Web-based portal, security, networking, UPS systems, web services and software tools
How do we combine data, knowledgeand information management with simulation and modeling? Applications: Medical informatics, Biosciences, Ecoinformatics,… Visualization How do we represent data, information and knowledge to the user? Data Mining, Simulation Modeling, Analysis, Data Fusion How do we detect trends and relationships in data? Knowledge-Based Integration Advanced Query Processing How do we obtain usableinformation from data? Grid Storage Filesystems, Database Systems How do we collect, accessand organize data? How do we configure computer architectures to optimally support data-oriented computing? High speed networking Networked Storage (SAN) sensornets instruments Storage hardware HPC Data Cyberinfrastructure Requires a Coordinated Approach interoperability integration
Data Integration in the Biosciences Data Integration in the Geosciences Software to access data Software to federate data Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Working with Data: Data Integration for New Discovery Where can we most safely build a nuclear waste dump? Where should we drill for oil? What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How does it relate to host rock structures? Data Integration Complex “multiple-worlds” mediation Geo-Physical Geo-Chronologic Geo-Chemical Foliation Map Geologic Map
Data Preservation • Many Science, Cultural, and Official Collections must be sustained for the foreseeable future • Critical collections must be preserved: • community reference data collections(e.g. Protein Data Bank) • irreplaceable collections(e.g. field data – tsunami recon) • longitudinal data(e.g. PSID – Panel Study of Income Dynamics) • No plan for preservation often means that data is lost or damaged “….the progress of science and useful arts … depends on the reliable preservation of knowledge and information for generations to come.” “Preserving Our Digital Heritage”, Library of Congress
How much Digital Data*? iPod Shuffle (up to 120 songs) = 512 MegaBytes Printed materials in the Library of Congress = 10 TeraBytes 1 human brain at the micron level= 1 PetaByte SDSC HPSS tape archive = 6 PetaBytes 1 novel = 1 MegaByte All worldwide information in one year = 2 ExaBytes 1 Low Resolution Photo = 100 KiloBytes * Rough/average estimates
Key Challenges for Digital Preservation • What should we preserve? • What materials must be “rescued”? • How to plan for preservation of materials by design? • How should we preserve it? • Formats • Storage media • Stewardship – who is responsible? • Who should pay for preservation? • The content generators? • The government? • The users? • Who should have access? Print media provides easy access for long periods of time but is hard to data-mine Digital media is easier to data-mine but requires management of evolution of media and resource planning over time
SDSC Cyberinfrastructure Community Resources DATA ENVIRONMENT • 1 PB Storage-area Network (SAN) • 10 PB StorageTek tape library • DB2, Oracle, MySQL • Storage Resource Broker • HPSS • 72-CPU Sun Fire 15K • 96-CPU IBM p690s • http://datacentral.sdsc.edu/ Support for 60+ community data collections and databases Data management, mining, analysis, and preservation COMPUTE SYSTEMS • DataStar • 2396 Power4+ processors, IBM p655 and p690 nodes • 10 TB total memory • Up to 2 GBps I/O to disk • TeraGrid Cluster • 512 Itanium2 IA-64 processors • 1 TB total memory • Intimidata • Only academic IBM Blue Gene system • 2,048 PowerPC processors • 128 I/O nodes http://www.sdsc.edu/user_services/ SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES • User Services • Application/Community Collaborations • Education and Training • SDSC Synthesis Center • Community SW, toolkits, portals, codes • http://www.sdsc.edu/
Thank You kamratha@sdsc.edu www.sdsc.edu