350 likes | 367 Views
Scientific Data Preservation and Access Needs: Looking Forward. Brad Hemminger bmh@ils.unc.edu School of Information and Library Science University of North Carolina at Chapel Hill. Three stories… . Astronomy Medical Imaging Genetics. Astronomy Data Growth. From glass plates to CCDs
E N D
Scientific Data Preservation and Access Needs: Looking Forward Brad Hemminger bmh@ils.unc.edu School of Information and Library Science University of North Carolina at Chapel Hill
Three stories… • Astronomy • Medical Imaging • Genetics
Astronomy Data Growth • From glass plates to CCDs • detectors follow Moore’s law • The result: a data tsunami • available data doubles every two years • Telescope growth • 30X glass (concentration) • 3000X in pixels (resolution) • Single images • 16Kx16K pixels • Large Synoptic Survey Telescope • wide field imaging at 5 terabytes/night 3+ M telescopes area m^2 CCD area mpixels Source: Alex Szalay/Jim Gray
Large Synoptic Survey Telescope (LSST) • Top project of the astronomy decadal survey • Celestial cinematography • 2 gigapixel detector for wide field imaging • Science • beyond the standard model • non-baryonic dark matter • non-zero L and neutrino oscillations • observation targets • near Earth object survey • weak lensing of wide fields • supernovae measurements • Features • 7 square degree field/6.9 meter effective aperture • > 5 TB of data/night from a mountain in Chile
Distributed Virtual Astronomy • Capabilities • homogeneous, multi-wavelength data • observations of millions of objects • mega-sky surveys (2MASS, SLOAN, …) • Initiatives • U.S. National Virtual Observatory (NVO) • Caltech, JHU, ALMA, HST, … • EU Astrophysical Virtual Observatory (AVO) • ESO, CNRS, CDS, … • Grid data mining and archives • discovering significant patterns • analysis of rich image/catalog databases • understanding complex astrophysical systems • integrated data/large numerical simulations HST Data Access
Biomedical Imaging Challenges Source: Chris Johnson, Utah and Art Toga, UCLA
Medical Imaging Needs • Many imaging modalities exist in medicine. Most are based on raster scanning (pixel matrix represents a scanned image plane). • 2D slices (Xray) • spatial series (volumes, CT,MRI, US) • time series (videos, heart studies, US) • Radiology example: multi-slice CT scanner. Each slice is 512x512 pixels with up to a thousand slices in a study (0.5 Gigabytes per study).
Genetics • Genetic Sequences are very simplest representation: as series of characters (text) representing nucleotide sequences. • There are many more complex objects, including images…
Disease Gene sequence Phenotype Clinical trial Genome sequence Gene expression Disease Gene expression Drug Protein Disease Protein Structure Disease homology Protein Sequence P-P interactions Data Heterogeneity and Complexity Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), … Proteome Source: Carole Goble (Manchester)
Gene Expression and Microarrays • Concurrent evaluation • expression levels for thousands of genes • Photolithography • up to 500K 10-20 micron cells • each containing millions of identical DNA molecules • Image capture and analysis • laser scanning and intensity calculation Source: Affymetrix
Why is it important to capture? • Previous research was documented in scientific literature and books. Increasingly though, our theories and methods are based on empirical measurements. Data gathered or sampled from our environment. Without a record of this data preserved, we cannot verify previous work, or build on existing work.
Memex: Still Prescient “Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” Vannevar Bush “As We May Think,” 1945
It seems reasonable to envision, for a time 10 or 15 years hence, a 'thinking center' that will incorporate the functions of present-day libraries together with anticipated advances in information storage and retrieval. The picture readily enlarges itself into a network of such centers, connected to one another by wide-band communication lines and to individual users by leased-wire services. In such a system, the speed of the computers would be balanced, and the cost of the gigantic memories and the sophisticated programs would be divided by the number of users. J.C.R. Licklider, 1960 Human-Computer Symbiosis
21st Century Challenges • The three fold way • theory and scholarship • experiment and measurement • computation and analysis • Supported by • distributed, multidisciplinary teams • multimodal collaboration systems • distributed, large scale data sources • leading edge computing systems • distributed experimental facilities • Socialization and community • multidisciplinary groups • geographic distribution • new enabling technologies • creation of 21st century IT infrastructure • sustainable, multidisciplinary communities • “Come as you are” response Computation Experiment Theory
Effect of Technology on Science • Theories • Computational models • Real World Observations/Measurements Most science areas had theories, with limited sensor measurements. With today’s technology we can increasingly acquire mountains of sensor measurements and we have computational models to check against.
Example Changes… • High Energy Physics: colliders, lasers • Astronomy, large CCD based telescopes, virtual arrays, space telescopes • Medical Imaging: multitudes of scanning techniques with increasing resolution. Computational anatomical models • Genetics: measurements of nucleotides, proteins, small molecules, etc.
Types of Challenges • Technical • Data storage, computing power, data ingest. • Knowledge (scholarly communications) • How do we share information (different terms, different languages) • How do we preserve? • Medical imaging formats last 10-20 years (larger capital investment, clinical care) • Commercial sequencer data format no longer exists 2 years after product introduced (rapid technology changes, research only settings)
Technical Challenges: The Data Tsunami • Many sources • agricultural • biomedical • environmental • engineering • manufacturing • financial • social and policy • historical • Many causes and enablers • increased detector resolution • increased storage capability • The challenge: extracting insight! We Are Here!
Sensor Data Overload Source: Robert Morris, IBM
Storage: Qualitative Change 1972 80 GB in 2004 5 MB in 1956
Storage: in practical terms • Megabyte • a small novel • Gigabyte • a pickup truck filled with paper or a DVD • Terabyte: one thousand gigabytes – ~$1000 today • the text in one million books • entire U.S. Library of Congress is ~ten terabytes of text • Petabyte: one thousand terabytes • 1-2 petabytes equals all academic research library holdings • coming soon to a pocket near you! • soon routinely generated annually by many scientific instruments • Exabyte: one thousand petabytes • 5 exabytes of words spoken in the history of humanity • See www.sims.berkeley.edu/research/projects/how-much-info-2003/ Source: Hal Varian, UC-Berkeley
Knowledge Challenges:preservation requires standards • Storage formats • Media (CDROM , DVD, tapes) • File formats (PDF, JPEG, MPEG) • Standards that define meaning for particular domain (metadata, controlled vocabularies, taxonomies). Examples from medical and biological sciences: MeSH, DICOM, MIAME, GO, caBIG.
Who pursues standards? • Users, i.e. scientists (GO, multitudes of domain specific examples) • Manufacturers (companies making products) • Storage media (CDROM, DVD, DVD 2nd generation not quite yet ) • Knowledge standards not so frequently pursued, more often in conjunction with push from user community (DICOM, MIAME) • Government (MeSH, GenBank, caBIG)
Three Critical Steps to Success • The scholarly communities must develop standards for communicating knowledge, and for the long term preservation of important descriptive information. I.e. taxonomies, controlled vocabularies. • Common public repositories (can be many centers federated as one logical one) must be setup to store, preserve and provide access (Genbank). • Change behavior (bring about usage) by • mandating by funding agency (NIH, NSF) • Requirement for publication (GenBank for sequences).
What role does the government play? • Has developed standards in areas where significant support was provided (medicine and science via NLM, i.e. Medline, Mesh, UMLS, GenBank, Entrez, etc). And future (NIH caBIG, etc.). • Successful with high cost shared resources (colliders (CERN), astronomy telescopes, etc.).
What role should the government play? • Should government funded grants require deposit of scientific data in repositories? Of research papers in public repositories (PubMed). • Should the government build and/or fund the other repositories and their maintenance? • Should individual grants receive more money to support the publication, annotation and deposit of research results? • What about the many other areas not addressed (Google, not the government is digitizing literature).
PITAC Report Contents • Computational Science: Ensuring America’s Competitiveness • A Wake-up Call: The Challenges to U.S. Preeminence and Competitiveness • Medieval or Modern? Research and Education Structures for the 21st Century • Multi-decade Roadmap for Computational Science • Sustained Infrastructure for Discovery and Competitiveness • Research and Development Challenges • Two key appendices • Examples of Computational Science at Work • Computational Science Warnings – A Message Rarely Heeded • Available at www.nitrd.gov
PITAC Recommendation • The Federal government must implement coordinated, long-term computational science programs that include funding for interconnecting the software sustainability centers, national data and software repositories, and national high-end leadership centers with the researchers who use those resources, forming a balanced, coherent system that also includes regional and local resources. • Such funding methods are customary practice in research communities that use scientific instruments such as light sources and telescopes, increasingly in data-centered communities such as those that use the genome database, and in the national defense sector.
Additional Challenges In addition to the lack of an overall semantic interoperable framework for multidisciplinary research and public repositories… • Motivation for research labs to adhere to standards, especially when storing and describing data. (example UNC stories). • Ownership, provenance • Privacy and security (IRB approval for future research) • Indexing, data mining
Challenges for Universities • Multiple cultures • arts, humanities and social sciences • sciences and engineering • Many scholarly communication approaches • books, monographs, journals, conferences • access time, priority and intellectual property • multiple media and expression • text, audio, video, artifacts, performances, … • primary and secondary source materials • professional societies and private publishers • Institutional repositories • multiple visions and roles • digital archives and/or alternative publication venues • research and education • access modes and goals, not just articles or books • longitudinal access and lifelong learning • what and how much to save • declining cost of storage and simplicity of deposit
Computing History • 1890-1945 • mechanical, relay • 7 year doubling • 1945-1985 • tube, transistor,.. • 2.3 year doubling • 1985-2003 • microprocessor • 1 year doubling • Exponentials • chip transistor density: 2X in ~18 months • WAN bandwidth: 64X in two years • storage: 7X in two years • graphics: 100X in three years Microcomputer Revolution 4K bit core plane Source: Jim Gray
Computing Power Trends http://www.transhumanist.com/volume1/moravec.htm
Example: Linking Genotype and Phentotype to study diseases Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Ethnicity Environment Age Gender Identify Genes Pharmacokinetics Metabolism Endocrine Biomarker Signatures Physiology Proteome Transcriptome Immune Morphometrics Predictive Disease Susceptibility Source: Terry Magnuson