390 likes | 526 Views
A Grand Challenge for the Information Age. Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego. The Fundamental Driver of the Information Age is Digital Data. Education. Entertainment. Shopping. Health.
E N D
A Grand Challenge for the Information Age Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego
The Fundamental Driver of the Information Age is Digital Data Education Entertainment Shopping Health Information Business
Data at multiple scales in the Biosciences Data from multiple sources in the Geosciences Data Accessand Use DataIntegration Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Digital Data Critical for Research and Education Where should we drill for oil? What is the Impact of Global Warming? How are the continents shifting? Data Integration Complex “multiple-worlds” mediation What genes are associated with cancer? What parts of the brain are responsible for Alzheimers? Geo-Physical Geo-Chronologic Geo-Chemical Foliation Map Geologic Map
Today’s Presentation • Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications • Challenges in Building and Delivering Capable Data Infrastructure • Sustainable Digital Preservation – Grand Challenge for the Information age
Data Cyberinfrastructure Today – Designing and Developing Infrastructure for Today’s Data-Oriented Applications
Today’s Data-oriented Applications Span the Spectrum DATA (more BYTES) Designing Infrastructure for Data: Data and High Performance Computing Data and Grids Data and CyberinfrastructureServices Data-intensiveand Compute-intensive HPC applications Data-intensive applications Home, Lab, Campus, Desktop Applications Compute-intensiveHPC Applications Data Grid Applications COMPUTE (more FLOPS) NETWORK (more BW) Grid Applications
DATA (more BYTES) Data and High Performance Computing • For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which • Grid platforms not a strong option • Data must be local to computation • I/O rates exceed WAN capabilities • Continuous and frequent I/O is latency intolerant • Scalability is key • Need high-bandwidth and large-capacity local parallel file systems, archival storage Data-intensiveand Compute-intensive HPC applications Data-intensive applications Data-intensive applications Compute-intensiveHPC Applications Compute-intensive applications COMPUTE (more FLOPS)
: Earthquake Simulation at Petascale– better prediction accuracy creates greater data-intensive demands Information courtesy of the Southern California Earthquake Center
Data and HPC: What you see is what you’ve measured Cray XD1 -- Custom Interconnect Dalco Linux Cluster -- Quadrics Interconnect Sun Fire Cluster -- Gigabit ethernet Interconnect • Three systems using the same processor and number of processors. • AMD Opteron 64 processors 2.2 GHz • Difference is in way the processors are interconnected • HPC Challenge benchmarks measure different machine characteristics • Linpackand matrix multiply are computationally intensive • PTRANS (matrix transpose), RandomAccess, bandwidth/latency tests and other tests begin to reflect stress on memory system FLOPS alone are not enough. Appropriate benchmarks needed to rank/bring visibility to more balanced machines critical for today’s applications. Information courtesy of Jack Dongarra
Data and Grids • Data applications some of the first applications which • required Grid environments • could naturally tolerate longer latencies • Grid model supports key data application profiles • Compute at site A with data from site B • Store Data Collection at site A with copies at sitesB and C • Operate instrument at site A, move data to site B for storage, post-processing, etc. CERN data providing key driver for grid technologies
Data Services Key for TeraGrid Science Gateways • Science Gateways provide common application interface for science communities on TeraGrid • Data services key for Gateway communities • Analysis • Visualization • Management • Remote access, etc. NVO LEAD GridChem Information and images courtesy of Nancy Wilkins-Diehr
Unifying Data over the Grid – the TeraGrid GPFS WAN Effort • User wish list • Unlimited data capacity. (everyone’s aggregate storage almost looks like this) • Transparent, high speed access anywhere on the Grid • Automatic archiving and retrieval • No Latency. • TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid • Looks like local disk to grid sites • Uses automatic migration with a large cache to keep files always “online” and accessible. • Data automatically archived without user intervention Information courtesy of Phil Andrews
Data Services – Beyond Storage to Use What services do users want? How can I combine my data with my colleague’s data? How should I organize my data? How do I make sure that my data will be there when I want it? What are the trends and what is the noise in my data? My data is confidential; how do I make sure that it is seen/used only by the right people? How should I display my data? How can I make my data accessible to my collaborators?
Services: Integrated Environment Key to Usability analysis modeling • Database selection and schema design • Portal creation and collection publication • Data analysis • Data mining • Data hosting • Preservation services • Domain-specific tools • Biology Workbench • Montage (astronomy mosaicking) • Kepler (Workflow management) • Data visualization • Data anonymization, etc. Integrated Infrastructure Data Access simulation visualization Data Manipulation Data Management File systems,Database systems, Collection ManagementData Integration, etc. computers instruments Data Storage Many Data Sources Sensor-nets
Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data • Broad program to support research and community data collections and databases • DataCentralservices include: • Public Data Collections and Database Hosting • Long-term storage and preservation (tape and disk) • Remote data management and access (SRB, portals) • Data Analysis, Visualization and Data Mining • Professional, qualified 24/7 support PDB – 28 TB • DataCentralresources include • 1 PB On-line disk • 25 PB StorageTek tape library capacity • 540 TB Storage-area Network (SAN) • DB2, Oracle, MySQL • Storage Resource Broker • Gpfs-WAN with 700 TB Web-based portal access
Data Visualization is key SCEC Earthquake simulations Visualization of Cancer Tumors Prokudin– Gorskii historical images Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, David Minor, U.S. Library of Congress
Infrastructure Should be Non-memorable • Good infrastructure should be • Predictable • Pervasive • Cost-effective • Easy-to-use • Reliable • Unsurprising • What’s required to build and provide useful, usable, and capable data Cyberinfrastructure?
Building Capable Data Cyberinfrastructure: Incorporating the “ilities” • Scalability • Interoperability • Reliability • Capability • Sustainability • Predictability • Accessibility • Responsibility • Accountability • …
Reliability • How can we maximize data reliability? • Replication, UPS systems, heterogeneity, etc. • How can we measure data reliability? • Network availability= 99.999% uptime (“5 nines”), • What is the equivalent number of “0’s” for data reliability? Reliability: What can go wrong Information courtesy of Reagan Moore
Responsibility and Accountability • What are reasonable expectations between users and repositories? • What are reasonable expectations between federated partner repositories? • What are appropriate models for evaluating repositories? • What incentives promote good stewardship? What should happen if/when the system fails? • Who owns the data? • Who takes care of the data? • Who pays for the data? • Who can access the data?
Good Data Infrastructure Incurs Real Costs Capability Costs Capacity Costs • Reliabilityincreased by up-to-date and robust hardware and software for • Replication (disk, tape, geographically) • Backups, updates, syncing • Audit trails • Verification through checksums, physical media, network transfers, copies, etc. • Data professionals needed to facilitate • Infrastructure maintenance • Long-term planning • Restoration, and recovery • Access, analysis, preservation, and other services • Reporting, documentation, etc. • Most valuable data must be replicated • SDSC research collections have been doubling every 15 months. • SDSC storage is 25 PB and counting. Data is from supercomputer simulations, digital library collections, etc. Information courtesy of Richard Moore Information courtesy of Richard Moore
Economic Sustainability Relay Funding • Making Infinite Funding Finite • Difficult to support infrastructure for data preservation as an infinite, increasing mortgage • Creative partnerships help create sustainable economic models User fees, recharges Geisel Library at UCSD Consortium support Endowments Hybrid solutions
How much Digital Data is there? SDSC HPSS tape archive =25+ PetaBytes • 5 exabytes of digital information produced in 2003 • 161 exabytes of digital information produced in 2006 • 25% of the 2006 digital universe is born digital (digital pictures, keystrokes, phone calls, etc.) • 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format) • 1 zettabyte aggregate digital information projected for 2010 iPod (up to 20K songs) =80 GB 1 novel =1 MegaByte U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital” Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007
How much Storage is there? • 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage • Given the projected rates of growth, we will never have enough space again for all digital information Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007
Focus for Preservation: the “most valuable” data • What is “valuable”? • Community reference data collections (e.g. UniProt, PDB) • Irreplaceable collections • Official collections (e.g. census data, electronic federal records) • Collections which are very expensive to replicate (e.g. CERN data) • Longitudinal and historical data • and others … Value Cost Time
National, InternationalScale “Regional” Scale Local Scale The Data Pyramid A Framework for Digital Stewardship Digital Data Collections Repositories/Facilities • Preservation efforts should focus on collections deemed “most valuable” • Key issues: • What do we preserve? • How do we guard against data loss? • Who is responsible? • Who pays? Etc. IncreasingValue IncreasingTrust Increasingrisk/responsibility Increasingstability Increasinginfra-structure Reference, nationally important, and irreplaceable data collections National / Internaional-scale data repositories, archives, and libraries. Key research and community data collections “Regional”-scale libraries and targeted data centers. Personal data collections Private repositories.
Digital Collections of Community Value National, InternationalScale “Regional” Scale Local Scale • Key techniques for preservation: replication, heterogeneous support The Data Pyramid
: A Conceptual Model for Preservation Data Grids The Chronopolis Model • Geographically distributed preservation data grid that supports long-term management , stewardship of, and access to digital collections • Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. • Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation Distributed Production Preservation Environment Digital Information of Long-Term Value TechnologyForecasting and Migration Administration, Policy, Outreach
Chronopolis Focus Areas and Demonstration Project Partners • 2 Prototypes: • National Demonstration Project • Library of Congress Pilot Project • Partners • SDSC/UCSD • U Maryland • UCSD Libraries • NCAR • NARA • Library of Congress • NSF • ICPSR • Internet Archive • NVO UCSD Libraries • Chronopolis R&D, Policy, and Infrastructure Focus areas: • Assessment of the needs of potential user communities and development of appropriate service models • Development of formal roles and responsibilities of providers, partners, users • Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. • Development of appropriate cost and risk modelsfor long-term preservation • Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure Demonstration Project information courtesy of Robert McDonald
Chronopolis Federation architecture NCAR U Md SDSC Chronopolis Site National Demonstration Project – Large-scale Replication and Distribution • Focus on supporting multiple, geographically distributed copies of preservation collections: • “Bright copy”– Chronopolis site supports ingestion, collection management, user access • “Dim copy”– Chronopolis site supports remote replica of bright copy and supports user access • “Dark copy”– Chronopolis site supports reference copy that may be used for disaster recovery but no user access • Each site may play different roles for different collections Dim copy C1 Dark copy C1 Dark copy C2 Bright copy C2 Bright copy C1 Dim copy C2 • Demonstration collections included: • National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey] • Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data] • NCAR Observational Data [3 TB of Observational and Re-Analysis Data]
SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress Prokudin-GorskiiPhotographs (Library of Congress Prints and Photographs Division) http://www.loc.gov/exhibits/empire/ (also collection of web crawls from the Internet Archive) Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.” • Historically important 600 GB Library of Congress image collection • Images over 100 years old with red, blue, green components (kept as separate digital files). • SDSC stores 5 copies with dark archival copy at NCAR • Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information Library of Congress Pilot Project information courtesy of David Minor
Pilot Projects provided invaluable experience with key Issues Technical Issues How to address Integrity, verification, provenance, authentication, etc. Legal/Policy Issues Who is responsible? Who is liable? Social Issues What formats/standards are acceptable to the community? How do we formalize trust? • Infrastructure Issues • What kinds of resources (servers, storage, networks) are required? • How should they operate? • Evaluation Issues • What is reliable? • What is successful? • Cost Issues • What is cost-effective? • How can support be sustained over time?
It’s Hard to be Successful in the Information Age without reliable, persistent information • Inadequate/unrealistic general solution: “Let X do it” where X is: • The Government • The Libraries • The Archivists • Google • The private sector • Data owners • Data generators, etc. • Creative partnerships neededto provide preservation solutions with • Trusted stewards • Feasible costs for users • Sustainable costs for infrastructure • Very low risk for data loss, etc.
Office of CyberInfrastructure Blue Ribbon Task Force to Focus on Economic Sustainability • InternationalBlue Ribbon Task Force (BRTF-SDPA) to begin in 2008 to study issues of economic sustainability of digital preservation and access • Support from • National Science Foundation • Library of Congress • Mellon Foundation • Joint Information Systems Committee • National Archives and Records Administration • Council on Library and Information Sources University State College USER Federal Non-profit Commercial Local International Image courtesy of Chris Greer
BRTF-SDPA Charge to the Task Force: • To conduct a comprehensive analysis of previous and current efforts to develop and/or implement models for sustainable digital information preservation; (First year report) • To identify and evaluate best practice regarding sustainable digital preservation among existing collections, repositories, and analogous enterprises; • To make specific recommendations for actions that will catalyze the development of sustainable resource strategies for the reliable preservation of digital information; (Second Year report) • Provide a research agenda to organize and motivate future work. How you can be involved: • Contribute your ideas (oral and written “testimony”) • Suggest readings (website will serve as a community bibliography) • Write an article on the issues for a new community (Important component will be to educate decision makers and the public about digital preservation) Website to be launched this Fall. Will link from www.sdsc.edu
Many Thanks • Phil Andrews, Reagan Moore, Ian Foster, Jack Dongarra, Authors of the IDC Report, Ben Tolo, Reagan Moore, Richard Moore, David Moore, Robert McDonald, Southern California Earthquake Center, David Minor, Amit Chourasia, U.S. Library of Congress, Moores Cancer Center, National Archives and Records Administration, NSF, Chris Greer, Nancy Wilkins-Diehr, and many others … www.sdsc.edu berman@sdsc.edu