550 likes | 683 Views
The Future of Scientific Computing at Harvard. Alyssa A. Goodman Professor of Astronomy Director, Initiative in Innovative Computing. “The Heavy Red Bag”. How can computers advance (my) science?. A new collaborative scientific initiative at Harvard.
E N D
The Future of Scientific Computing at Harvard Alyssa A. Goodman Professor of Astronomy Director, Initiative in Innovative Computing
“The Heavy Red Bag” How can computers advance (my) science?
Computational challenges are common across scientific disciplines How to: Acquire, transmit, organize, and query new kinds of data? Apply distributed computing resources to solve complex problems? Derive meaningful insight from large datasets? Share, integrate and analyze knowledge across geographically dispersed researchers? Visually represent scientific results so as to maximize understanding? Opportunity to collaborate and apply insights from one field to another
Filling the “Gap” between Science and Computer Science Scientific disciplines Computer Science departments Increasingly, core problems in science require computational solution Typically hire/“home grow” computationalists, but often lack the expertise or funding to go beyond the immediate pressing need Focused on finding elegant solutions to basic computer science challenges Often see specific, “applied” problems as outside their interests
Workflow IIC contact: AG, FAS
Workflowa.k.a. The Scientific Method (in the Age of the Age of High-Speed Networks, Fast Processors, Mass Storage, and Miniature Devices) IIC contact: Matt Welsh, FAS
Faculty of Arts and Sciences • Harvard College • Division of Engineering Harvard School of Public Health • Faculty of Medicine • Harvard Medical School • Affiliated Teaching Hospitals Harvard IIC Workflow: The Harvard Virtual Brain Establishing a Harvard-wide Neuroscience Infrastructure • Data Acquisition • MRI • PET • Microscopy • etc. Distributed Data Storage • Data Processing • Analysis • Visualization • Integration • etc. • Information Access • Query • Statistical Analysis • Knowledge Management • etc. IIC contact: David Kennedy, HMS/MGH
New technologies for measurement and simulation are transforming the “workflow.” Biomedicine: pre-genomics Biomedicine: genomics era • Manual/low throughput • Solitary • Limited by two hands • Analog • High throughput • Automated/networked • Highly scalable • Digital
Continuum “Computational Science” Missing at Most Universities “Pure” Computer Science (e.g. Turing) “Pure” Discipline Science (e.g. Galileo)
Workflow & Continuum For any particular scientific investigation: Where does, and could, “computational science” make improvements in this cycle?
Harvard Public Health “NOW” (Oct. 2004) "In the past, experiments did not involve such large data sets," observed Dyann Wirth, professor of infectious diseases in the Department of Immunology and Infectious Diseases and member of the advisory group for the core. "There has been a dramatic change in the past five to 10 years in the amount and availability of genomic data [or the DNA sequences themselves] and functional genomic data, [or the sequences’ purpose]." In the past five years alone, the genomes of humans, rats, and the malaria parasite Plasmodium Falciparum have been published, for example. "One of the purposes of bioinformatics is to reduce the number of experiments that need to be done to achieve reliable information," said L.J. Wei, professor of biostatistics in the Department of Biostatistics and member of the advisory group for the core. "However, an issue right now is that there are huge data sets that can be run through different kinds of software programs, ending up with many data points. Unless we understand and use bioinformatics well, we may not even know which of those data points are important."
Filling the “computational science” gap: IIC Problem-driven approach …focusing effort on solving problems that will have greatest impact & educational value Collaborative projects …combining disciplinary knowledge with computer science expertise Interdisciplinary effort …to ensure that best practices are shared across fields and that new tools and methodologies will be broadly applicable Links with industry …to draw on and learn from experience in applied computation Institutional funding …to ensure effort is directed towards key needs and not driven solely by narrow priorities of funding agencies
Numerical Simulation of Star Formation • MHD turbulence gives “t=0” conditions; Jeans mass=1 Msun • 50 Msun, 0.38 pc, navg=3 x 105 ptcls/cc • forms ~50 objects • T=10 K • SPH, no B or L, G • movie=1.4 free-fall times Bate, Bonnell & Bromm 2002 (UKAFF)
Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002. Goal:Statistical Comparison of “Real” and “Synthesized” Star Formation
Radio Spectral-line Observations of Interstellar Clouds Radio Spectral-Line Survey Alves, Lada & Lada 1999
VelocityfromSpectroscopy Observed Spectrum Telescope Spectrometer 1.5 1.0 Intensity 0.5 0.0 -0.5 All thanks toDoppler 100 150 200 250 300 350 400 "Velocity"
VelocityfromSpectroscopy Observed Spectrum Telescope Spectrometer 1.5 1.0 Intensity 0.5 0.0 -0.5 All thanks toDoppler 100 150 200 250 300 350 400 "Velocity"
COMPLETE/FCRAO W(13CO) Barnard’s Perseus
“Astronomical Medicine” Excerpts from Junior Thesis of Michelle Borkin (Harvard College); IIC Contacts: AG (FAS) & Michael Halle (HMS/BWH/SPL)
IC 348 IC 348
“Astronomical Medicine” Before “Medical Treatment” After “Medical Treatment”
3D Slicer Demo (available after talk) IIC contacts: Michael Halle & Ron Kikinis
IIC: Innovative Organizational Model Highly accomplished academicsandsenior experts whose careers have been primarily in industry, working together Staffing Criteria for promotion will give equal weight to scholarly activities, and to technological invention Promotion/ career path No “class” distinctions made between teaching and non-teaching faculty, scientists and engineers, artists and designers working in the visualization program Culture
How IIC will Function: Overview IIC Objectives Identify and fund projects that are likeliest to have the greatest and broadest impact Pursue projects in way that will yield best outcome, enable shared learning, etc. Enable new research for specific scientific discipline Generate new computational tools for broader application Project selection Project execution Dissemination of knowledge
Project Selection Role Submit proposal in response to call for ideas Evaluate/rank proposals for scientific merit: should this be a priority for IIC? Evaluate/prioritize proposals according to technical feasibility, assess resource needs • Who participates • Any Harvard researcher (e.g., in genomics, fluid dynamics, epidemiology,neuroscience, nanoscience, comp bio, chemical biology, optics, geology, astronomy, quantum mechanics, et al.) • Harvard researchers representing broad interests of IIC stakeholders plus IIC Director & Dir. of Research • Consists of • IIC Director • Dirs. of Res. & Adm/Ops • Heads of IIC branches Project proposals Program Advisory Committee IIC Management Team
Project Execution IIC Project Team C, etc. Responsible for project execution and metrics for tracking progress/performance; interfaces with IIC branch heads IIC Project Team B Project Manager IIC Project Team A Project Manager Project Manager Discipline scientists IIC staff Discipline scientists IIC staff Discipline scientists IIC staff IIC staff scientists assigned to work on project by relevant IIC branch heads. The same IIC staff member may serve on multiple IIC project teams Scientists who “own” the problem and are committed to working with IIC staff to tackle it
Seminars/colloquia Publications Dissemination of Knowledge Communities of practice • Scientific journals • IIC white papers • Internal... • External… Knowledge management system • New tools • IIC process
Education is central to IIC’s mission At Harvard: Undergraduate & graduate courses focused on “data-intensive science” New graduate certificate program, within existing Ph.D. programs Research opportunities at undergraduate, graduate, and postdoctoral levels Beyond Harvard: New museum, highlighting the kind of science done at the IIC
IIC organization: research and education Provost Dean, Physical Sciences Assoc Provost IIC Director Dir of Admin & Operations Dir of Research Assoc Dir, Instrumentation Assoc Dir, Visualization Assoc Dir, Databases/Data Provenance Assoc Dir, Distributed Computing Assoc Dir, Analysis & Simulation Dir of Education & Outreach Education & Outreach staff Project 1 (Proj Mgr 1) Project 2 (Proj Mgr 2) Project 3 (Proj Mgr 3) Etc. CIO (systems) Knowledge mgmt
Visualization: 3D Slicer (BWH Surgical Planning Lab) IIC contacts: Michael Halle & Ron Kikinis
“Image and Meaning” (Visualization) IIC contact: Felice Frankel (MIT) Work: Garstecki/Whitesides (FAS)
Distributed Computing: Semantics, Ontologies IIC Contact: Tim Clark (HMS/MGH)
Distributed Computing & Large Databases: Large Synoptic Survey Telescope Optimized for time domain scan mode deep mode 7 square degree field 6.5m effective aperture 24th mag in 20 sec > 5 Tbyte/night Real-time analysis Simultaneous multiple science goals IIC contact: Christopher Stubbs (FAS)
Astronomy High Energy Physics LSST SDSS 2MASS MACHO DLS BaBar Atlas RHIC First year of operation 2011 1998 2001 1992 1999 1998 2007 1999 Run-time data rate to storage (MB/sec) 5000 Peak 500 Avg 8.3 1 1 2.7 60 (zero-suppressd) 6* 540* 120* (’03) 250* (’04) Daily average data rate (TB/day) 20 0.02 0.016 0.008 0.012 0.6 60.0 3 (’03) 10 (’04) Annual data store (TB) 2000 3.6 6 1 0.25 300 7000 200 (’03) 500 (’04) Total data store capacity (TB) 20,000(10 yrs) 200 24.5 8 2 10,000 100,000 (10 yrs) 10,000 (10 yrs) Peak computational load (GFLOPS) 140,000 100 11 1.00 0.600 2,000 100,000 3,000 Average computational load (GFLOPS) 140,000 10 2 0.700 0.030 2,000 100,000 3,000 Data release delay acceptable 1 day moving 3 months static 2 months 6 months 1 year 6 hrs (trans) 1 yr (static) 1 day (max) <1 hr (typ) Few days 100 days Real-time alert of event 30 sec none none <1 hour 1 hr none none none Type/number of processors TBD 1GHz Xeon 18 450MHz Sparc 28 60-70MHz Sparc 10 500MHz Pentium 5 Mixed/ 5000 20GHz/ 10,000 Pentium/ 2500
Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002. Analysis & Simulations
Analysis & Simulations: Neural Net Models of Intelligence Does Speed of Convergence in Neural Nets Predict Scores on Measures of “General Intelligence”? Select from the lower 8 the one that completes the pattern in the top 9 IIC contact: Stephen Kosslyn (Psychology)
OMIM on the genome 24 23 22 21 20 19 18 17 16 15 Chromosome 14 13 12 11 10 9 8 7 6 5 4 3 2 2 1 1 0 0 50 100 150 200 250 Position (MB) (Easier) Analysis of Large Data Sets: Mendelian Disease Genes Hello world 189 Hello world 189 Hello world 189 Hello world 189 Hello world 189 Hello world 189 Hello world 189 Hello world 189 reformat, merge, and filter Large data files Can a biologist get from here to there? Without programming? Location of every known disease gene on the human genome IIC contact: Eitan Rubin (FAS/CGR)