470 likes | 583 Views
Virtual Organizations: Building Interdisciplinary Collaborations. Dan Reed reed@renci.org Chancellor’s Eminent Professor Vice Chancellor for IT University of North Carolina at Chapel Hill Director, Renaissance Computing Institute. Acknowledgments. Funding agencies NIH
E N D
Virtual Organizations: Building Interdisciplinary Collaborations Dan Reed reed@renci.org Chancellor’s Eminent Professor Vice Chancellor for IT University of North Carolina at Chapel Hill Director, Renaissance Computing Institute
Acknowledgments • Funding agencies • NIH • Carolina Center for Exploratory Genetic Analysis (CCEGA) • NSF • TeraGrid Science Gateways • State of North Carolina • RENCI and ancillary Bioportal support • RENCI staff • Alan Blatecky, Kevin Gamiel, Xiaojun Guan • Clark Jefferies, Howard Lander • John Magee, Ruth Marinshaw, Jeff Tilson • Lavanya Ramakrishnan • And a host of others …
21st Century Challenges • The three fold way • theory and scholarship • experiment and measurement • computation and analysis • Supported by • distributed, multidisciplinary teams • multimodal collaboration systems • distributed, large scale data sources • leading edge computing systems • distributed experimental facilities • Socialization and community • multidisciplinary groups • geographic distribution • new enabling technologies • creation of 21st century IT infrastructure • sustainable, multidisciplinary communities • “Come as you are” response Computation Experiment Theory
Exemplar 21st Century Challenges • Population growth in sensitive areas • severe weather sensitivity • national impact • geobiology and environment • economics and finance • sociology and policy • Economics and health care • longitudinal public health data • environmental interactions • genetic susceptibility • heart disease, cancer, Alzheimer's • privacy and insurance • public policy and coordination
Mean Onset of Alzheimer’s Disease • apolipoprotein (apo) • apoE2, apoE3 and apoE4 alleles • on chromosome 19 • apoE4 allele • 40% to 60% of Alzheimer's patients • not the only cause for Alzheimer’s • apo gene inheritance • ~25% inherit 1 copy of apoE4 allele • Alzheimer's risk increases 4X • 2% inherit 2 copies of apoE4 allele • Alzheimer's risk increases 10X 1.0 2/3 0.8 2/4 0.6 3/3 Proportion of each genotype unaffected 0.4 3/4 0.2 4/4 0 60 65 70 75 80 85 Age at onset Source: Alan Roses, GSK
Protein structure Protein/enzyme function TATA Promoter QYR C A G TAC Message Homology based protein structure prediction Molecular simulations CGT Big Questions Protein sequence and regulation DNA sequence Sequence Annotation Data integration Network analysis Pathway simulations Multi-protein machines Organs, Organisms and Ecologies Metabolic pathways and regulatory networks Bacteria and cells
Genetics and Disease Susceptibility Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Ethnicity Environment Age Gender Identify Genes Pharmacokinetics Metabolism Endocrine Biomarker Signatures Physiology Proteome Transcriptome Immune Morphometrics Predictive Disease Susceptibility Source: Terry Magnuson, UNC
PITAC Report Contents • Computational Science: Ensuring America’s Competitiveness • A Wake-up Call: The Challenges to U.S. Preeminence and Competitiveness • Medieval or Modern? Research and Education Structures for the 21st Century • Multi-decade Roadmap for Computational Science • Sustained Infrastructure for Discovery and Competitiveness • Research and Development Challenges • Two key appendices • Examples of Computational Science at Work • Computational Science Warnings – A Message Rarely Heeded • Available at www.nitrd.gov
Life Science Lessons from Astronomy • Historically, discoveries accrued to those • with access to unique data • who built next generation telescopes • Two things changed • growing costs and complexity of telescopes • emergence of whole sky surveys • The result – virtual astronomy • discovering significant patterns • analysis of rich image/catalog databases • understanding complex astrophysical systems • integrated data/large numerical simulations
{Inter}national Virtual Observatory 3. X-ray and Optical Images retrieved via SIA interface Chandra SIA NED Cone Search Skyview SIA CADC CNOC Cone Search DSS SIA 5. Initial Galaxy Catalog generated via Cone Search DSS SIA CNOC SIA Cluster Galaxy Morphology Analysis Portal 6. Image cutout pointers merged into catalog 2. Look up cluster in internally stored catalog clusters Morphology Calculation Service Morphological parameters calculated on grid for each galaxy 7. User’s Machine 1. User selects a cluster User downloads final table and images for analysis & visualization 4. User launches distributed analysis 8. web browser Source: Ray Plante, NCSA
The Bioinformatics Challenge • Challenge • the rise of quantitative biology • burgeoning bioinformatics data • complex analysis and modeling problems • education and training in new technologies • Reality • diverse tools with idiosyncratic interfaces • steep learning curves • software development by diverse groups • distributed, databases with diverse metadata • Need • integrated, easy-to-use toolset with standard interfaces • extensible mechanisms that hide idiosyncrasies • tool and bioinformatics training • The solution • bioinformatics infrastructure and coupled training
Need: Simple, Easy-To-Use Tools “Genome. Bought the book. Hard to read.” Eric Lander
Web and Social Processes • Google • it’s a search engine, it’s a verb, … • Blogs • published self-expression • Instant Messenger • social networks • Wireless messaging • semi-synchronous • Internet commerce • the dot.com boom/bust • EBay, Amazon • Spam, phishing, … • anti-social behavior
Benefits of Standards • Interoperability • Separation of concerns • Reuse • Independence • Dependability • Sharing • Commonality • Shared knowledge base • knowledge reuse • simplification (one hopes)
It’s been 12 years! What’s A Grid/Web Service? http:// Web: Uniform access to documents http:// Software catalogs Grid/Web Services: Flexible, high-performance access to resources and services for distributed communities Computers Sensors and instruments Colleagues Data archives
Grid History: I-Way at SC’95 • A prototype national infrastructure • 17 sites, connected by • vBNS and six other ATM networks • 60 applications • Features • I-POPs for site access • Kerberos authentication • manual scheduling • distributed communication libraries • Experiences • led to Globus Grid toolkit • Concurrent industry needs • led to web services for B2B interoperation
Web Services: “Commercial Grids” • From browser-centric to service-centric • from human-computer to computer-computer • structured negotiation and response • Workflow creation and management • end-to-end service negotiation • inter-organizational interaction • Prerequisites • metadata standard for service descriptions • standard communication mechanisms • resource discovery and registration
eBay Web Services Architecture • Over 40% of eBay's listings are now via API calls Source: IBM
Invoke Locate Publish Service Consumer Service Provider Service Broker Web Services: A Definition A web service is … designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact … [using] its description using SOAP-messages, … using HTTP with an XML serialization .... W3C Working Draft, August 2003 SOAP SOAP WSDL UDDI SOAP • SOAP (Simple Object Access Protocol) • WSDL (Web Services Description Language) • UDDI (Universal Description, Discovery and Integration)
Technology Push Source: Gartner Group
European myGrid Architecture Source: www.mygrid.org
The Bioinformatics Challenges • Complex, multilevel models • integration and in silico designs • Information visualization • complexity and scale • Data models and ontologies • community definition • Data federation, storage and management • shared access and support • User access portals • web-based tool and service interfaces • Packaging, distribution and deployment • community building
Multilevel Cellular Models • Signaling networks • environmental triggers and behavior • e.g., cell lifecycle • different pathways in each tissue type • Metabolic networks • measurable products in pathway • many systems are steady state • negative feedback leads to stabilization • Protein interaction networks • localization of proteins that interact for function • protein-protein interactions for specific actions • Gene regulatory networks • many things affect gene product concentration • nucleic-nucleic, protein-nucleic interactions • Computing, physics, engineering and biology • control theory, mathematical models, phase spaces • from biological cartoons to predictive models • e.g., microRNAs and gene expression controls
Simulation and prediction structures and dynamics Reasoning and discovery reverse engineering 10-12 10-9 10-6 10-3 100 103 106 Bond Motion Catalysis Growth & Division Diffusion Transcription Translation 100 102 104 106 108 1010 1012 Metabolites Proteins Ribosomes Prokaryotes Eukaryotes Biological Models Temporal (seconds) Spatial (nM3)
Biophysical and Environmental Modeling Airway/flow Mucus Disease, Environment and Medicine Cilia Cell biochemistry and structure Proteomics Genomics Source: Ric Boucher, UNC
Disease Gene sequence Phenotype Clinical trial Genome sequence Gene expression Disease Gene expression Drug Protein Disease Protein Structure Disease homology Protein Sequence P-P interactions Data Heterogeneity and Complexity Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), … Proteome Source: Carole Goble (Manchester)
Sensor Data Overload Source: Chris Johnson, Utah Art Toga, UCLA Source: Robert Morris, IBM • High resolution brain imaging • 4.5 petabytes (PB) per brain
RENCI: What Is It? • Statewide objectives • create broad benefit in a competitive world • engage industry, academia, government and citizens • Four target areas • public benefit • supporting urban planning, disaster response, … • economic development • helping companies and people with innovative ideas • research engagement across disciplines • catalyzing new projects and increasing success • building multidisciplinary partnerships • education and outreach • providing hands on experiences and broadening participation • Mechanisms and approaches • partnerships and collaborations • infrastructure as needed to accomplish goals
Carolina Center for Exploratory Genetic Analysis (CCEGA) Interoperable Data Management Faculty, Staff & Students Driving Problems Promoting Mutual Awareness Experimental Genetics Portal Analysis Techniques Statistical & Computational Techniques Extant Data Models Virtuous Cycle Interdisciplinary Research & Education
Coordination team Dan Reed, RENCI Terry Magnuson, CCGS Alan Blatecky, RENCI Kirk Wilhelmsen, CCGS Eleven departments/institutes Biostatistics Cancer Center Genetics Computer Science Epidemiology Genetics Health Science Library Information and Library Science Pharmacy RENCI Statistics Campus wide support from many sources Project participants Brad Hemminger, Information & Library Science James Evans, Genetics Kevin Gamiel, RENCI Xiaojun Guan, RENCI Barrie Hays, Health Science Library Clark Jefferies, RENCI Ethan Lange, Genetics Andrew Nobel, Statistics Karen Mohlke, Genetics Kari North, Epidemiology Susan Paulsen, Computer Science Fernando Manuel Pardo, Genetics Charles Perou, Cancer Center Lavanya Ramakrishnan, RENCI Jan Prins, Computer Science Patrick Sullivan, Genetics Lisa Susswein, Cancer Center David Threadgill, Genetics Alexander Tropsha, Pharmacy K.T.L. Vaughan, Health Science Library Fred Wright, Biostatistics Wei Wang, Computer Science Fei Zou, Biostatistics CCEGA Participants
Independent data management data security version control redundancy controlled access Data: From Lab and Clinic to Analysis ELSI Clinical ELSI Analysis Analysis Laboratory Integration & Informatics LAB Clinic Analysis • NIH CCEGA • Carolina Center for Exploratory Genetic Analysis Source: Brad Hemmenger, UNC
GenBank Data Management and Information Viz Published Domain Literature Taxonomy Annotation Ontology Annotation ….. DB Schema Ontology Annotation Annotated Domain Literature Information Mining Module Information Visualization Module
From SNPs to HapMap • Single Nucleotide Polymorphisms (SNPs) • one in ~1200 bases differ across individuals • SNPs act as markers to locate genes • Common groups of SNPs are shared • i.e., form a haplotype • HapMap data sources • 90 Yoruba individuals (30 trios) from Nigeria (YRI) • 90 individuals (30 trios) of European descent from Utah (CEU) • 45 Han Chinese individuals from Beijing (CHB) • 45 Japanese individuals from Tokyo (JPT) • ~3,500,000 SNPs typed • basis for association studies for disease identification
Synthetic data disease models model testing mining bakeoffs CCEGA HapMap Simulator
Carolina Bioportal • Three overlapping target groups • undergraduate education • graduate education and research • academic/industrial research • Features • access to common bioinformatics tools • extensible toolkit and infrastructure • OGCE and National Middleware Initiative (NMI) • leverages emerging international standards • remotely accessible or locally deployable • packaged and distributed with documentation • National reach and community • TeraGrid deployment • science gateway • Education and training • hands-on workshops • clusters, Grids, portals and bioinformatics
Application Interface Workflow service App Instance App Instance App Instance Open Grid Service Architecture Layer Data Management Service Registries and Name binding Security Policy Logging Accounting Service Administration & Monitoring Reservations And Scheduling Grid Orchestration Event/Message Service Resource Layer (from PCs to Supercomputers) Distributed Grid and Web Services Launch, configure and control Grid Portals Open Grid Service Infrastructure (web service component model) Online instruments Source: Dennis Gannon, Indiana
Bioportal Architecture Bioportal Interface Generator HTML Files PISE Application XML Description Application Processing • www.ncbioportal.org Velocity Files User Profile Job Submission Remote File Access Job Records Authentication, Grid Credential Application Databases Command Files Job History Database Application Processing OGCE User Databases MyProxy GridFTP Gatekeeper Local cluster • OGCE toolkit • used by cyberinfrastructure projects • LEAD, NEES, PACI, DOE, TeraGrid …
Putting the Technologies Together NC Bioportal OGCE Toolkit (Grid middleware) PISE (XML Wrapper) Tomcat (Apache servlet container) Chef (collaboration/standard portlets) Jakarta Jetspeed (enterprise portal) Bio Applications Turbine (web app framework) Velocity (template engine) Grid Portlets, CoG VMC Databases
Community Software Toolkit: Lessons • NSF PACI Alliance “In a Box” toolkits • cluster software (aka OSCAR) • Grid infrastructure (aka NMI) • Access Grid for distributed collaboration • tiled display walls for visualization • Distribution materials • software and training materials • CDs and web • Community workshops and training • Linux Clusters Institute • MSI HPC workshops • hands on training • Lowering the entry barrier • usage and deployment • Bioportal distribution • workshops, tutorials • training materials • road shows Bioportal Distribution
NC Bioportal: What’s Next • Engagement • workshops, experiences and deployments • Infrastructure • dynamic job scheduling across multiple sites • migration to OGCE 2.0 • fully automated database updates • workflow construction and processing • Portal tool suite • expanded applications and databases • phylogeny, morphology, microarray analysis, … • Training materials • additional modules based on user feedback • workshop materials packaged for self-study • Leverage national presence • TeraGrid/NCSA bioinformatics portal
The Vision of Grid/Web Services “… Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do.” • Book of Genesis Peter Bruegel The Tower of Babel (1563) We're Not There Yet ...
Interdisciplinary Collaborations • Appropriate reward structures • well-matched time constants • Intellectual equality • balanced recognition of contributions • Research/infrastructure distinctions • timelines and people needs differ • Confidentiality and openness • academic/industry collaboration perspectives • Intellectual property • background IP and differential disciplinary models
Some Thoughts on the Future • Grids/web services are not a panacea • we have seen this movie before • standards debates can be endless • make new mistakes, not the same old ones • code is shifted from modules to interfaces • Danger of “Death by CS Abstraction” • “all problems can be solved by another level of indirection” • Appropriate decomposition is a challenge • performance, usability, flexibility • Generality and extensibility really matter • incremental aggregation and interoperability • data management and federation • Better questions, not just private capabilities • limited by creativity not resources
The Cambrian Explosion • Most phyla appear • sponges, archaeocyathids, brachiopods • trilobites, primitive mollusks, echinoderms • Indeed, most appeared quickly! • Tommotian and Atdbanian • as little as five million years • Lessons for computing • it doesn’t take long when conditions are right • raw materials and environment • leave fossil records if you want to be remembered!