1 / 29

Empowering Biomedical Research with National Big Data Cyberinfrastructure

This presentation highlights the role of a national big data cyberinfrastructure in supporting computational biomedical research. Delivered by Dr. Larry Smarr, the talk focuses on initiatives such as the OptIPuter project, the CAMERA Project, and the advancements in microbial metagenomics. Key concepts discussed include the utilization of supercomputing resources, enhancing data transfer capabilities with the Science DMZ model, and building a cyber community for advanced microbial ecology research. The presentation underscores the importance of improving campus networks to accelerate scientific advancements and provides insights into the future of big data infrastructure for research purposes.

shannony
Download Presentation

Empowering Biomedical Research with National Big Data Cyberinfrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research” Invited Presentation Symposium on Computational Biology and Bioinformatics: Remembering John Wooley National Institutes of Health Bethesda, MD July 29, 2016 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net

  2. John Wooley Drove Supercomputing for Biological Sciences

  3. John Wooley was a Scientific Founder of Calit2 John Wooley was the UCSD Layer Leader for DeGeM 220 UCSD & UCI Faculty Working in Multidisciplinary Teams With Students, Industry, and the Community The State Provides $100 M For New Buildings and Equipment LS Slide 2001 www.calit2.net

  4. NSF’s OptIPuter Project: Using Supernetworks to Meet the Needs of Data-Intensive Researchers 2003-2009 $13,500,000 OptIPortal– Termination Device for the OptIPuter Global Backplane Biomedical Big Data as Application Driver: Mark Ellisman, co-PI Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PI Univ. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent

  5. The OptIPuter LambdaGrid is Rapidly Expanding StarLight Chicago UIC EVL U Amsterdam PNWGP Seattle NU NetherLight Amsterdam CAVEwave/NLR NASA Ames NASA Goddard NASA JPL NLR NLR 2 2 ISI 2 SDSU CENIC Los Angeles GigaPOP CalREN-XD 8 UCI CICESE CENIC/Abilene Shared Network UCSD 8 via CUDI CENIC San Diego GigaPOP LS Slide 2005 1 GE Lambda 10 GE Lambda Source: Greg Hidley, Aaron Chin, Calit2

  6. Paul Gilna Ex. Dir. PI Larry Smarr John Wooley was a CAMERA co-PI &Chief Science Officer Announced January 17, 2006 $24.5M Over Seven Years

  7. Calit2 Microbial Metagenomics Cluster-Next Generation Optically Linked Science Data Server Source: Phil Papadopoulos, SDSC, Calit2 ~200TB Sun X4500 Storage 10GbE 512 Processors ~5 Teraflops ~ 200 Terabytes Storage 1GbE and 10GbE Switched/ Routed Core

  8. The CAMERA Project Established a GlobalMarine Microbial Metagenomics Cyber-Community Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis 4000 Registered Users From Over 80 Countries http://camera.calit2.net/

  9. Determining the Protein Structures of the Thermophilic Thermotoga Maritima Genome—Life at 80oC! LS Slide 2005 Extremely Thermostable -- Useful for Many Industrial Processes (e.g. Chemical and Food) 173 Structures (122 from JCSG) • 122 T.M. Structures Solved by JCSG (75 Unique In The PDB) • Direct Structural Coverage of 25% of the Expressed Soluble Proteins • Probably Represents the Highest Structural Coverage of Any Organism Source: John Wooley, JCSG Bioinformatics Core Project Directro, UCSD

  10. John Wooley Organized a Series of International Workshopson Metagenomics and Thermotoga at Calit2

  11. Academic Research OptIPlanet Collaboratory:A 10Gbps “End-to-End” Lightpath Cloud HD/4k Live Video HPC Local or Remote Instruments End User OptIPortal National LambdaRail 10G Lightpaths Campus Optical Switch LS 2009 Slide Data Repositories & Clusters HD/4k Video Repositories

  12. So Why Don’t We Have a NationalBig Data Cyberinfrastructure? “Research is being stalled by ‘information overload,’ Mr. Bement said, because data from digital instruments are piling up far faster than researchers can study. In particular, he said, campus networks need to be improved. High-speed data lines crossing the nation are the equivalent of six-lane superhighways, he said. But networks at colleges and universities are not so capable. “Those massive conduits are reduced to two-lane roads at most college and university campuses,” he said. Improving cyberinfrastructure, he said, “will transform the capabilities of campus-based scientists.” -- Arden Bement, the director of the National Science Foundation May 2005

  13. DOE ESnet’s Science DMZ: A Scalable Network Design Model for Optimizing Science Data Transfers • A Science DMZ integrates 4 key concepts into a unified whole: • A network architecture designed for high-performance applications, with the science network distinct from the general-purpose network • The use of dedicated systems for data transfer • Performance measurement and network testing systems that are regularly used to characterize and troubleshoot the network • Security policies and enforcement mechanisms that are tailored for high performance science environments The DOE ESnet Science DMZ and the NSF “Campus Bridging” Taskforce Report Formed the Basis for the NSF Campus Cyberinfrastructure Network Infrastructure and Engineering (CC-NIE) Program Science DMZ Coined 2010 http://fasterdata.es.net/science-dmz/

  14. Based on Community Input and on ESnet’s Science DMZ Concept,NSF Has Funded Over 100 Campuses to Build Local Big Data Freeways Red 2012 CC-NIE Awardees Yellow 2013 CC-NIE Awardees Green 2014 CC*IIE Awardees Blue 2015 CC*DNI Awardees Purple Multiple Time Awardees 2012-2015 CC-NIE / CC*IIE / CC*DNI Programs Source: NSF

  15. Creating a “Big Data” Freeway on Campus:NSF-Funded Prism@UCSD and CHeruB Campus CC-NIE Grants CHERuB Prism@UCSD, PI Phil Papadopoulos, SDSC, Calit2, (2013-15) CHERuB, PI Mike Norman, SDSC

  16. NCMIR Brain Images in Calit2 VROOM:Allows for Interactive Zooming from Cerebellum to Individual Neurons NCMIR Connected Over Prism to Calit2/SDSC at 80 Gbps

  17. Calit2 3D Immersive StarCAVE OptIPortal:Enables Interative Exploration of Protein Data Bank 15 Meyer Sound Speakers + Subwoofer Connected at 50 Gb/s to Quartzite 30 HD Projectors! Passive Polarization-- Optimized the Polarization Separation and Minimized Attenuation Source: Tom DeFanti, Greg Dawe, Calit2 Cluster with 30 Nvidia 5600 cards-60 GB Texture Memory

  18. The Pacific Wave PlatformCreates a Regional Science-Driven “Big Data Freeway System” Funded by NSF $5M Oct 2015-2020 • PI: Larry Smarr, UC San Diego Calit2 • Co-PIs: • Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2, • Philip Papadopoulos, UC San Diego SDSC, • Frank Wuerthwein, UC San Diego Physics and SDSC Flash Disk to Flash Disk File Transfer Rate Source: John Hess, CENIC

  19. Pacific Research Platform Regional Collaboration:Multi-Campus Science Driver Teams • Jupyter Hub • Biomedical • Cancer Genomics Hub/Browser • Microbiome and Integrative ‘Omics • Integrative Structural Biology • Earth Sciences • Data Analysis and Simulation for Earthquakes and Natural Disasters • Climate Modeling: NCAR/UCAR • California/Nevada Regional Climate Data Analysis • CO2 Subsurface Modeling • Particle Physics • Astronomy and Astrophysics • Telescope Surveys • Galaxy Evolution • Gravitational Wave Astronomy • Scalable Visualization, Virtual Reality, and Ultra-Resolution Video

  20. PRP Transforms Big Data Microbiome and Integrated ‘Omics Science Knight 1024 Cluster In SDSC Co-Lo PNNL UC Davis LBNL Caltech 1.3Tbps Data Oasis 7.5PB, 200GB/s CHERuB 100Gbps 120Gbps Knight Lab Emperor & Other Vis Tools 10Gbps 40Gbps 12 Cores/GPU 128 GB RAM 3.5 TB SSD 48TB Disk 10Gbps NIC Gordon Prism@UCSD 64Mpixel Data Analysis Wall

  21. To Expand IBD Project the Knight/Smarr Labs Were Awarded ~ 1 Million Core-Hours on SDSC’s Comet Supercomputer • 8x Compute Resources Over Prior Study • Smarr Gut Microbiome Time Series • From 7 Samples Over 1.5 Years • To 50 Samples Over 4 Years • IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100 Patients • 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank • 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients • New Software Suite from Knight Lab • Re-annotation of Reference Genomes, Functional / Taxonomic Variations • Novel Compute-Intensive Assembly Algorithms from PavelPevzner

  22. We Used SDSC’s Comet to Uniformly Compute Protein-Coding Genes, RNAs, & CRISPR Annotations • We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes • Required 5 Core-Hours Per Genome • 300,000 Core-Hours to Complete • Ran 24 Cores in Parallel • Over 400 Days Wall-Clock Time • Requires a Variety of Software Programs • Prodigal for Gene Prediction • Diamond for Protein Homolog Search Against UniRef db • Infernal for ncRNA Prediction • RNAMMER for rRNA Prediction • Aragorn for tRNA Prediction • Will Make These Results a New Community Database • Knight Lab, Calit2, SDSC Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD

  23. Cancer Genomics Hub (UCSC) is Housed in SDSC:Large Data Flows to End Users at UCSC, UCB, UCSF, … 1G 8G 30,000 TB Per Year Jan 2016 15G Data Source: David Haussler, Brad Smith, UCSC

  24. Creating a Distributed Cluster for Integrated Modelingof Large Macromolecular Machines • UCSF-10-100 Gbps Science DMZ • QB3@UCSF (~5000 cores), • Institute for Human Genetics (~1200 cores), • Cancer Center (~800 cores), • Molecular Structure Group (~1000 cores). • Coupled Via PRP to: • LBNL NERSC • SDSC • Bring Huge Datasets from Supercomputer Centers Back to UCSF Clusters for Analysis Requires CPU-months per computation Lead: Andrej Sali, UCSF

  25. 3D Reconstructions from NCMIR X-ray Microscopic Computed Tomography Facilitates Development of Bioinspired “Tough” Materials UCR researchers are modeling the teeth (radula) of marine snail, CryptochitonStelleri, to engineer new biomimetic abrasion resistant composites NCMIR X-ray Microscope (XRM) Zeiss Versa 510 MicroCT reconstructions of Chiton radula. Chiton radula have evolved to incorporate an iron oxide mineral, magnetite, making them extremely hard and magnetic. Images courtesy of Steven Herrera, Ph.D., KisailusBiomemetics and Nanostructured Materials Laboratory, UC Riverside Driving Improvements in Scientific Data Transfer UCSD/NCMIR Fiona/Data Transfer Node (DTN) UC Riverside Fiona/Data Transfer Node (DTN) PRP Facilitated Collaborative Data Transfer 10-100Gbps XRM Data Sets are 100+ GBs

  26. Next Step: Global Research PlatformBuilding on CENIC/Pacific Wave and GLIF Current International GRP Partners

  27. Cell Image Library Designed For “Big Data” Leverages High Bandwidth Connected High Performance Storage and Computing Resources Mirror Cell Image Library Infrastructure and Data Management Workflows at Singapore’s NSCC Source: Mark Ellisman & Steve Peltier, NCMIR, UCSD

  28. Data “Wormhole” Facilitating Data Intensive Collaboration Between NSCC and the National Center for Microscopy and Imaging Research (NCMIR) at UC San Diego

More Related