320 likes | 333 Views
Three stories… . Astronomy Medical Imaging Genetics. Astronomy Data Growth. From glass plates to CCDs detectors follow Moore’s law The result: a data tsunami available data doubles every two years Telescope growth 30X glass (concentration) 3000X in pixels (resolution) Single images
E N D
Three stories… • Astronomy • Medical Imaging • Genetics
Astronomy Data Growth • From glass plates to CCDs • detectors follow Moore’s law • The result: a data tsunami • available data doubles every two years • Telescope growth • 30X glass (concentration) • 3000X in pixels (resolution) • Single images • 16Kx16K pixels • Large Synoptic Survey Telescope • wide field imaging at 5 terabytes/night 3+ M telescopes area m^2 CCD area mpixels Source: Alex Szalay/Jim Gray
Large Synoptic Survey Telescope (LSST) • Top project of the astronomy decadal survey • Celestial cinematography • 2 gigapixel detector for wide field imaging • Science • beyond the standard model • non-baryonic dark matter • non-zero L and neutrino oscillations • observation targets • near Earth object survey • weak lensing of wide fields • supernovae measurements • Features • 7 square degree field/6.9 meter effective aperture • > 5 TB of data/night from a mountain in Chile
Distributed Virtual Astronomy • Capabilities • homogeneous, multi-wavelength data • observations of millions of objects • mega-sky surveys (2MASS, SLOAN, …) • Initiatives • U.S. National Virtual Observatory (NVO) • Caltech, JHU, ALMA, HST, … • EU Astrophysical Virtual Observatory (AVO) • ESO, CNRS, CDS, … • Grid data mining and archives • discovering significant patterns • analysis of rich image/catalog databases • understanding complex astrophysical systems • integrated data/large numerical simulations HST Data Access
Biomedical Imaging Challenges Source: Chris Johnson, Utah and Art Toga, UCLA
Medical Imaging Needs • Many imaging modalities in medicine. Most are based on raster scanning (pixel matrix represented scanned image). • Most are 2D slices, some are time series (videos). • Radiology example is multi-slice CT scanner. Each slice is 512x512 pixels with up to a thousand slices in a study (0.5 Gigabytes per study).
Genetics • Genetic Sequences are very simplest representation. Many more complex ones, including images.
Disease Gene sequence Phenotype Clinical trial Genome sequence Gene expression Disease Gene expression Drug Protein Disease Protein Structure Disease homology Protein Sequence P-P interactions Data Heterogeneity and Complexity Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), … Proteome Source: Carole Goble (Manchester)
Gene Expression and Microarrays • Concurrent evaluation • expression levels for thousands of genes • Photolithography • up to 500K 10-20 micron cells • each containing millions of identical DNA molecules • Image capture and analysis • laser scanning and intensity calculation Source: Affymetrix
Why is it important to capture? • Previous research was documented in scientific literature and books. Increasingly though, our theories and methods are based on empirical measurements. Data gathered or sampled from our environment. Without a record of this data preserved, we cannot verify previous work, or build on existing work.
Memex: Still Prescient “Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” Vannevar Bush “As We May Think,” 1945
It seems reasonable to envision, for a time 10 or 15 years hence, a 'thinking center' that will incorporate the functions of present-day libraries together with anticipated advances in information storage and retrieval. The picture readily enlarges itself into a network of such centers, connected to one another by wide-band communication lines and to individual users by leased-wire services. In such a system, the speed of the computers would be balanced, and the cost of the gigantic memories and the sophisticated programs would be divided by the number of users. J.C.R. Licklider, 1960 Human-Computer Symbiosis
21st Century Challenges • The three fold way • theory and scholarship • experiment and measurement • computation and analysis • Supported by • distributed, multidisciplinary teams • multimodal collaboration systems • distributed, large scale data sources • leading edge computing systems • distributed experimental facilities • Socialization and community • multidisciplinary groups • geographic distribution • new enabling technologies • creation of 21st century IT infrastructure • sustainable, multidisciplinary communities • “Come as you are” response Computation Experiment Theory
Example: Linking Genotype and Phentotype to study diseases Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Ethnicity Environment Age Gender Identify Genes Pharmacokinetics Metabolism Endocrine Biomarker Signatures Physiology Proteome Transcriptome Immune Morphometrics Predictive Disease Susceptibility Source: Terry Magnuson
Challenges • Technical • Data storage, computing power, data ingest. • Knowledge (scholarly communications) • How do we share information (different terms, different languages) • How do we preserve? • Medical imaging formats last 10-20 years (larger capital investment, clinical care) • Commercial sequencer data format no longer exists 2 years after product introduced (rapid technology changes, research only settings)
Technical Data Challenges • Multitudes of input sources (data being generated) • Computing power requirements • Storage requirements
The Data Tsunami • Many sources • agricultural • biomedical • environmental • engineering • manufacturing • financial • social and policy • historical • Many causes and enablers • increased detector resolution • increased storage capability • The challenge: extracting insight! We Are Here!
Sensor Data Overload Source: Robert Morris, IBM
Computing History • 1890-1945 • mechanical, relay • 7 year doubling • 1945-1985 • tube, transistor,.. • 2.3 year doubling • 1985-2003 • microprocessor • 1 year doubling • Exponentials • chip transistor density: 2X in ~18 months • WAN bandwidth: 64X in two years • storage: 7X in two years • graphics: 100X in three years Microcomputer Revolution 4K bit core plane Source: Jim Gray
Computing Power Trends http://www.transhumanist.com/volume1/moravec.htm
Storage: Qualitative Change 1972 80 GB in 2004 5 MB in 1956
Storage: in practical terms • Megabyte • a small novel • Gigabyte • a pickup truck filled with paper or a DVD • Terabyte: one thousand gigabytes – ~$1000 today • the text in one million books • entire U.S. Library of Congress is ~ten terabytes of text • Petabyte: one thousand terabytes • 1-2 petabytes equals all academic research library holdings • coming soon to a pocket near you! • soon routinely generated annually by many scientific instruments • Exabyte: one thousand petabytes • 5 exabytes of words spoken in the history of humanity • See www.sims.berkeley.edu/research/projects/how-much-info-2003/ Source: Hal Varian, UC-Berkeley
Knowledge preservation requires standards • Storage formats • Media (CDROM , DVD, tapes) • File formats (PDF, JPEG, MPEG) • Standards that define meaning for particular domain (metadata, controlled vocabularies, taxonomies). Examples from medical and biological sciences: MeSH, DICOM, MIAME, GO, caBIG.
Who pursues standards? • Users (scientists) (GO, multitudes of domain specific examples) • Manufactuers (companies making products) • Storage media (CDROM, DVD, DVD 2nd generation not quite ) • Knowledge standards not so frequent, more often in conjunction with push from user community (DICOM, MIAME) • Government (MeSH, GenBank, caBIG)
What role does/should the government play? • Has developed standards in areas where significant support was provided (medicine and science via NLM, i.e. Medline, Mesh, UMLS, GenBank, Entrez, etc). And future (NIH caBIG, etc.). • Successful with high cost shared resources (colliders (CERN), astronomy telescopes, XXX). • Many other areas not addressed though (?).
Critical Steps to Success? • Standards for long term preservation of important descriptive information be developed by community. Not details of format of lab machine, but that data generated follows national standards (controlled vocabularies). • Central repositories (federated OK) are setup to store, preserve and provide access. • Bring about usage by • mandating by funding age (NIH, NSF) • Requirement for publication (GenBank for sequences).
Challenges • Lack overall semantic interoperable framework for multidisciplinary research. • Motivation for research labs to adhere to standards, especially when storing and describing data. (example UNC stories). • Ownership, provenance and context • Privacy and security (IRB approval for future research) • Indexing, data mining.
University Data Challenges • Multiple cultures • arts, humanities and social sciences • sciences and engineering • Many scholarly communication approaches • books, monographs, journals, conferences • access time, priority and intellectual property • multiple media and expression • text, audio, video, artifacts, performances, … • primary and secondary source materials • professional societies and private publishers • Institutional repositories • multiple visions and roles • digital archives and/or alternative publication venues • research and education • access modes and goals, not just articles or books • longitudinal access and lifelong learning • what and how much to save • declining cost of storage and simplicity of deposit
PITAC Report Contents • Computational Science: Ensuring America’s Competitiveness • A Wake-up Call: The Challenges to U.S. Preeminence and Competitiveness • Medieval or Modern? Research and Education Structures for the 21st Century • Multi-decade Roadmap for Computational Science • Sustained Infrastructure for Discovery and Competitiveness • Research and Development Challenges • Two key appendices • Examples of Computational Science at Work • Computational Science Warnings – A Message Rarely Heeded • Available at www.nitrd.gov
PITAC Recommendation • The Federal government must implement coordinated, long-term computational science programs that include funding for interconnecting the software sustainability centers, national data and software repositories, and national high-end leadership centers with the researchers who use those resources, forming a balanced, coherent system that also includes regional and local resources. • Such funding methods are customary practice in research communities that use scientific instruments such as light sources and telescopes, increasingly in data-centered communities such as those that use the genome database, and in the national defense sector.