430 likes | 550 Views
One Hundred Years of Data. Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego. The Digital World. Entertainment. Shopping. Information. How much Data is There?*. iPod Shuffle (up to 120 songs) = 512 MegaBytes.
E N D
One Hundred Years of Data Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego
The Digital World Entertainment Shopping Information
How much Data is There?* iPod Shuffle (up to 120 songs) = 512 MegaBytes Printed materials in the Library of Congress = 10+ TeraBytes 1 human brain at the micron level= 1 PetaByte SDSC HPSS tape archive = 6 PetaBytes 1 novel = 1 MegaByte All worldwide information in one year = 2 ExaBytes 1 Low Resolution Photo = 100 KiloBytes * Rough/average estimates
Research, Education, and Data Japanese Art Images – 70.6 GB JCSG/SLAC – 15.7 TB NVO – 100+ TB Life Sciences Astronomy Arts, and Humanities Engineering Projected LHC Data – 10 PB/year SCEC – 153 TB TeraBridge – 800 GB Geosciences Physics
Data (more BYTES) Data-oriented Science and Engineering Applications Driving the Next Generation of Technology Challenges Data-oriented Science and Engineering Applications TeraShake (Examples, Application subclasses) PDB applications NVO Home, Lab, Campus, Desktop Applications TraditionalHPC Applications Everquest MolecularModeling Quicken Compute (more FLOPS)
Data Stewardship • What is required for stewardship of data for the science and engineering community? • Who needs it? • How does data drive new discovery? • What facilities are required? • What’s involved in preserving data for the foreseeable future?
Data Stewardship • What is required for stewardship of data for the science and engineering community? • Who needs it? • How does data drive new discovery? • What facilities are required? • What’s involved in preserving data for the foreseeable future?
PDB: A resource for the global Biology community The Protein Data Bank • Largest repository on the planet for structural information about proteins • Provides free worldwide public access 24/7 to accurate protein data • PDB maintained by the Worldwide PDB administered by the Research Collaboratory for Structural Bioinformatics (RCSB), directed by Helen Berman Molecule of the Month: Glucose Oxidase Enzyme used for making the measurement of glucose (e.g. in monitoring diabetes) fast, easy, and inexpensive. 2006: > 5000 structures in one year, >36,000 total structures 1976-1990, roughly 500 structures or less per year Growth of Yearly/Total Structures in PDB
PDB accessible over the Internet and serves 10,000 users a day (> 200,000 hits) H. Berman estimated that in 2005, more than $1B of research funding was spent to generate the data that were collected, curated, and distributed by the PDB. WWW User Interface Data collected, annotated and validated at one of 3 worldwide PDB sites (Rutgers in US). Infrastructure required: 20 highly trained personnel and significant computational, storage and networking capabilities. SearchFields SearchLite Query Result Browser Structure Explorer New queries New queries New tools New tools DB INTEGRATION LAYER FLAT FILES Infrastructure: PDB portal served by cluster at SDSC. PDB system designed with multiple failover capabilities to ensure 24/7 access and 99.99% uptime. PDB infrastructure requires 20TB storage at SDSC DERIVED DATA FTP tree (download) BMCD KEYWORDSEARCH CORE DB CORBA Interface Remote Applications SDSC machine room How Does the PDB Work?
Path-forward Tool Development: Graphical User Interfaces Ligand - What other entries contain this? Chain- What other entries have chains with >90% sequence identity? Residue - What is the environment of this residue? Supporting and Sustaining the PDB • Consortium Funding (NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS) • Industrial Support(Advanced Chemistry Development Inc., IBM, Sybase, Compaq, Silicon Graphics, Sun Microsystems) • Multiple sites: wwPDB:RCSB (USA), PDBj (Japan), MSD-EBI (Europe) • Tool Development • Data Extraction and Preparation • Data Format Conversion • Data Validation • Dictionary and Data management • Tools supporting the OMB Corba Standard for Macromolecular Structure Data, etc.
Data Stewardship • What is required for stewardship of data for the science and engineering community? • Who needs it? • How does data drive new discovery? • What facilities are required? • What’s involved in preserving data for the foreseeable future?
Earthquake Simulations Major Earthquakes on the San Andreas Fault, 1680-present • Simulation results provide new scientific information enabling better • Estimation of seismic risk • Emergency preparation, response and planning • Design of next generation of earthquake-resistant structures • Results provide information which can help in saving many lives and billions in economic losses • Researchers use geological, historical, and environmental data to simulate massive earthquakes. • These simulations are critical to understand seismic movement, and assess potential impact. 1906 M 7.8 ? 1857 M 7.8 1680 M 7.7 How dangerous is the San Andreas Fault?
TeraShakeSimulation Simulation of Southern of 7.7 earthquake on lower San Andreas Fault • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m • Builds on 10 years of data and models from the Southern California Earthquake Center • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each • Simulation generates 45+ TB data
Data parking of 100s of TBs for many months “Fat Nodes” with 256 GB of DS for pre-processing and post visualization 10-20 TB data archived a day 240 procs on SDSC Datastar, 5 days, 1 TBof main memory 47 TB output data for 1.8 billion grid points Continuous I/O 2GB/sec SCEC Data Requirements Resources must support a complicated orchestration of computation and data movement Parallelfile system Dataparking The next generation simulation will require even more resources: Researchers plan to double the temporal/spatial resolution of TeraShake “I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.” Bernard Minster, Scripps Institute of Oceanography
Behind the Scenes – Enabling Infrastructure for TeraShake • Computers and Systems • 80,000 hours on 240 processors of DataStar • 256 GB memory p690 used for testing, p655s used for production run, TG used for porting • 30 TB Global Parallel file GPFS • Run-time 100 MB/s data transfer from GPFS to SAM-QFS • 27,000 hours post-processing for high resolution rendering • People • 20+ people involved in information technology support • 20+ people involved in geoscience modeling and simulation • Data Storage • 47 TB archival tape storage on Sun StorEdge SAM-QFS • 47 TB backup on High Performance Storage system HPSS • SRB Collection with 1,000,000 files • Funding • SDSC Cyberinfrastructure resources for TeraShake funded by NSF • Southern California Earthquake Center is an NSF-funded geoscience research and development center
Locality Data Stride 0 Ignored Linpack 1.00 Overflow 0.90 0.80 0.70 0.60 0.50 IBM Pwr3 Temporal locality 0.40 0.30 0.20 STREAM RandomAccess 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Spatial Locality Data Partner:The Data-Oriented Supercomputer • “Balanced system” provides support for tightly-coupled and strong I/O applications • Grid platforms not a strong option • Data local to computation • I/O rates exceed WAN capabilities • Continuous and frequent I/O is latency intolerant • Scalability • Need high-bandwidth and large-capacity local parallel file systems, archival storage Data Compute DoD apps plotted for locality
Data Stewardship • What is required for stewardship of data for the science and engineering community? • Who needs it? • How does data drive new discovery? • What facilities are required? • What’s involved in preserving data for the foreseeable future?
National Data Cyberinfrastructure Resources at SDSC DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS • 1.4 PB Storage-area Network (SAN) • 6 PB StorageTek tape library • HPSS and SAM-QFS archival systems • DB2, Oracle, MySQL • Storage Resource Broker • 72-CPU Sun Fire 15K • IBM p690s – HPSS, DB2, etc http://datacentral.sdsc.edu/ DATA-ORIENTED COMPUTE SYSTEMS • DataStar • 15.6 TFLOPS Power 4+ system • 7.125 TB total memory • Up to 4 GBps I/O to disk • 115 TB GPFS filesystem • TeraGrid Cluster • 524 Itanium2 IA-64 processors • 2 TB total memory • Also 12 2-way data nodes • Blue Gene Data • First academic IBM Blue Gene system • 2,048 PowerPC processors • 128 I/O nodes http://www.sdsc.edu/user_services/ Support for community data collections and databases Data management, mining, analysis, and preservation SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES • User Services • Application/Community Collaborations • Education and Training • SDSC Synthesis Center • Data-oriented Community SW, toolkits, portals, codes http://www.sdsc.edu/
National Data Repository: SDSC DataCentral • First broad program of its kind to support national research and community data collections and databases • “Data allocations” providedon SDSC resources • Data collection and database hosting • Batch oriented access, collection management services • Comprehensive data resources:disk, tape, databases, SRB, web services, tools, 24/7 operations, collection specialists, etc. Web-based portal access
Data Integration in the Biosciences Data Integration in the Geosciences Software to access data Software to federate data Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Working with Data: Data Integration for New Discovery Where can we most safely build a nuclear waste dump? Where should we drill for oil? What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How does it relate to host rock structures? Data Integration Complex “multiple-worlds” mediation Geo-Physical Geo-Chronologic Geo-Chemical Foliation Map Geologic Map
Data Systems SAM/QFS HPSS GPFS SRB Data Services Data migration/upload, usage and support (SRB) Database selection and Schema design (Oracle, DB2, MySQL) Database application tuning and optimization Portal creation and collection publication Data analysis (e.g. Matlab) and mining (e.g. WEKA) DataCentral Data-oriented Toolkits and Tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Vista Volume renderer (visualization), etc. Services, Tools, and Technologies Key for Data Integration and Management
100 Years of DataWhat’s involved in preserving data for the foreseeable future?
Digital State and Federal records Who Cares about Digital Preservation? The Public Sector UCSD Libraries The Private Sector The Entertainment Industry Researchers and Educators
Many Science, Cultural, and Official Collections Must be Sustained for the Foreseeable Future • Critical collections: • community reference data collections(e.g. Protein Data Bank) • irreplaceable collections(e.g. Shoah collection) • longitudinal data(e.g. PSID – Panel Study of Income Dynamics) • No plan for preservation often means that data is lost or damaged “….the progress of science and useful arts … depends on the reliable preservation of knowledge and information for generations to come.” “Preserving Our Digital Heritage”, Library of Congress
Key Challenges for Digital Preservation • What should we preserve? • What materials must be “rescued”? • How to plan for preservation of materials by design? • How should we preserve it? • Formats • Storage media • Stewardship – who is responsible, and for how long? • Who should pay for preservation? • The content generators? • The government? • The users? • Who should have access? Print media provides easy access for long periods of time but is hard to data-mine Digital media is easier to data-mine but requires management of evolution of media and resource planning over time
Preservation and Risk Less risk means more replicants, more resources, more people
NEW INITIATIVE: SDSC , the UCSD Libraries, NCAR, UMd working together on long-term preservation of digital collections Consortium Chronopolis: An Integrated Approach to Long-term Digital Preservation • Chronopolis provides a comprehensive approach to infrastructure for long-term preservation integrating • Collection ingestion • Access and Services • Research and development for new functionality and adaptation to evolving technologies • Business model, data policies, and management issues critical to success of the infrastructure
Chronopolis Federation architecture NCAR U Md SDSC Chronopolis Site Chronopolis – Replication and Distribution • 3 replicas of valuable collections considered reasonable mitigation for risk of data loss • Chronopolis Consortium will store 3 copies of preservation collections: • “Bright copy”– Chronopolis site supports ingestion, collection management, user access • “Dim copy”– Chronopolis site supports remote replica of bright copy and supports user access • “Dark copy”– Chronopolis site supports reference copy that may be used for disaster recovery but no user access • Each site may play different roles for different collections Dim copy C1 Dark copy C1 Dark copy C2 Bright copy C2 Bright copy C1 Dim copy C2
Data in the News Newsworthy items about Data • “Bank Data Loss May Affect Officials” Boston Globe, February 27, 2005 • Data tapes lost with information on more than 60 U.S. senators and others • “Data Loss Bug Afflicts Linux” ZDNet News, December 6, 2002 • “Programmers have found a bug … that, under unusual circumstances, could cause systems to drop data. … “ Newsworthy items about Supercomputing • “Simulating Earthquakes for Science and Society” HPCWire, January 27, 2006 • Simulation of 7.7 earthquake in lower San Andreas Fault • “Japanese supercomputer simulates Earth” BBC, April 26, 2002 • “A new Japanese supercomputer … was switched on this month and immediately outclassed its nearest rival.”
Data Preservation Requires a Different Sustainability Model than Supercomputing
The Branscomb Pyramid for Computing (circa 1993) FACILITIES APPLICATIONS “Leadership-class” facilities Maintained by national labs and centers. Substantive professional workforce Community codes and professional SW. Maintained by large groups of professionals (NASTRAN, Powerpoint, WRF, Everquest) High-end Community software and highly-used project codes. Developed and maintained by some professionals and academics (CHARMM, GAMESS, etc.) Mid-range university and research lab facilities. Maintained by professionals and non-professionals. campus, research lab Research and individual codes. Supported by developers or their proxies. Private, home, and personal facilities. Supported by users or their proxies. Small-scale, home
The “Berman” Pyramid for Data (circa 2006) FACILITIES COLLECTIONS High-end campus, library, data center National-scale data repositories, archives, and libraries. High capacity, high reliability environment maintained by professional workforce Reference, important, and irreplaceable data collections PDB, PSID, Shoah, Presidential Libraries, etc. Local libraries and data centers. Commercial data storage. Medium capacity, medium-high reliability. Maintained by professionals. Research data collections. Developed and maintained by some professionals and academics Private repository. Supported by users or their proxies. Low-medium reliability, low capacity Personal data collections. Supported by developers or their proxies. Small scale, home
What’s the Funding Model for the Data Pyramid? FACILITIES COLLECTIONS National-scale data repositories, archives, and libraries. High capacity, high reliability environment maintained by professional workforce Reference, important, and irreplaceable data collections PDB, PSID, Shoah, Presidential Libraries, etc. Research data collections. Developed and maintained by some professionals and academics Local libraries and data centers. Commercial data storage. Medium capacity, medium-high reliability. Maintained by professionals. Private repository. Supported by users or their proxies. Low-medium reliability, low capacity Personal data collections. Supported by developers or their proxies. Commercial Opportunities
Commercial Opportunities at the Low End • Cheap commercial data storage is moving us from a “napster model” (data is accessible and free) to an “iTunes model” (data is accessible and inexpensive)
Amazon S3 (Simple Storage Service) • Storage for Rent: • Storage is $.15 per GB per month • $.20 per GB data transfer (to and from) • Write, read, delete objects containing 1 GB-5GB (number of objects is unlimited), access controlled by user • For $2.00 +, you can store for one year • Lots of high resolution family photos • Multiple videos of your children’s recitals • Personal documentation equivalent to up to 1000 novels, etc. Should we store the NVO with Amazon S3? The National Virtual Observatory (NVO) is a critical reference collection for the astronomy community of data from the world’s large telescopes and sky surveys.
A Thought Experiment • What would it cost to store the SDSC NVO collection (100 TB) on Amazon? • 100,000 GB X $2 (1 ingest, no accesses + storage for a year) = $200K/year • 100,000 GB X $3 (1 ingest, average 5 accesses per GB stored + storage for a year) = $300K/year • Not clear: • How many copies Amazon stores • Whether the format is well-suited for NVO • Whether the usage model would make the costs of data transfer, ingest, access, etc. infeasible, etc. • If Amazon constitutes a “trusted repository” • What happens to your data when you stop paying, etc. • What about the CERN LHC collection (10 PB/year)? • 10,000,000 GB X $2 (1 ingest, no accesses per item + storage for a year) = $20M/year
What is the Business Model for the Upper levels of the Data Pyramid? FACILITIES COLLECTIONS ? ? National-scale data repositories, libraries, and archives Critical, valuable, and irreplaceable reference collections; very large collections Libraries and Data Centers Important Research collections; large data collections Personal data collections Personal Repositories Commercial Opportunities
Partnership Opportunities at the Middle Level • Creative investment opportunities: • Short-term investments: Building collections, website and tool development, finite support for facilities and collections, transition support for media, formats, etc. • Longer-term investments:Maintaining collections, maintaining facilities, evolving and maintaining software. • Public/Private partnerships must ensure reliability, trust. • Do you trust Amazon with your data? Google? Your university library? Your public library? • How much are content generators willing to pay to store their data? How much are users willing to pay to use the data? ? Opportunities for Creative Public, Private, and Philanthropic Partnerships Collections: Important Research collections; large data collections Facilities: Libraries and Data Centers Commercial Opportunities
Public Support Needed at the Top (National-scale) level Collections: Critical, valuable, irreplaceable reference collections; very large data collections • National-scale collections and facilities constitute critical infrastructure for academic, public, and private sectors • National-scale facilities must • Be trusted repositories • Be highly reliable • Provide high capacity and state of the art storage. • Have a 5 year, 50 year, 100+ year plan • Serve a national community, etc. • Public leadership, funding, and engagement critical for success PublicSupport Needed Opportunities for Creative Public, Private, and Philanthropic Partnerships Facilities: National-scale libraries, archives, and data repositories Commercial Opportunities
Thank You berman@sdsc.edu www.sdsc.edu