Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager

Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System • Xianfeng Jeff Chen Ph.D. • Research Investigator/Project Manager

(1) Introduction Agenda Today • VBI responsibility in Admin Center • PRCs datatype and organism • Proteomics data submission and storage work flow • VBI computing system architecture (CPU and storage) • VBI database system prototype and functionality • VBI existing database schema and status • Example Y2H schema for design logics and case study • Proposed data integration and knowledgebase construction (2) Database Development (3) Strategy on Knowledgebase Development

Introduction

Proteomics Data Management Tasks of Proteomics Data Management RAW DATA • (processed data) Data Storage & Visualization Tools (VBI) Data QA/QC, Interoperability (VBI/GU) Analysis, Annotation, & Curation (GU) SOP, LIMS, & Adm DB (SSS)

PRCs Major Data Type Organization Major Data Type University of MichiganMicroarray and mass spectrometry Caprion Mass spectrometry Harvard Proteomics Institute Genomics and protein expression array Albert Einsten College of MedicineMass spectrometry PNNL Mass spectrometry Scripps NMR structural and X-ray crystal diffraction data Myriad Genetics Yeast two-hybrid system

PRCs Organisms • Einstein Toxoplasma gondii, Cryptosporidium parvum • Caprion Brucella abortus • Harvard Bacillus anthracis (Protein array), Vibrio cholerae • Myriad Bacillus anthracis (Y2H), Yersinia pestis, • Francisella tularensis, vaccinia • PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi • Scripps SARS CoV • Michigan Bacillus anthracis(TXP, MS) + host (human)

Proteomics Data Flow Data Modeling w/ Decomposition 2D GELS Protein Array LC Immunoaffinity purification Y2H MS MS/MS NMR X-Ray Cryoelectron Microscopy X-Ray Defraction etc… PRCS Converting to Standard Format QA & QC QA & QC Standard Format VBI Quality Assurance & QualityControl Standard Format for Each Data Type Quality Assurance & QualityControl Relational Database Public Data Sources Data Types MIAME and MIAPE-like Standards/SOP for Data Submission

Database Development

VBI Computing System LINUX Web Server Gimli PC Users Jeff Wei Chaitanya Chengdong Ranjan Oswald Bruno SUN (Solaris) Project Elenwe Binary Software Data Storage Proteomics Application Server Genomics 7 PRCs Networked File Server Proteomics Chendong, Jeff, Wei, Ranjan, Chaitanya TUOR Relational Database Server

System Development in Q3 of 2005 Development Test/Stage Production Web Interface Database

Proteomics Database Project Websites Production:http://proteinbank.vbi.vt.edu/bprc Test: http://proteinbankdev.gepasi.org/bprc/ Development:http://txue.bioinformatics.vt.edu:8080/bprc http://wsun.vbi.vt.edu:8080/bprc/

Production Website Instance Dynamically generated webpage Functionalities: • Account management • File and doc management • News group and news update • Textual data display • 2D gel Image data display • Table and record query • Data uploading and simple submission • HTTP data downloading • SFTP file transfer

Database Query Search By Experiment Search By Organism • Select Experiment • Retrieve list of Bait protein • and nucleotide, Prey protein & • nucleotide • Links to details of bait and Prey • example: Drosophila melanogaster • Escherichia coli • Saccharomyces cerevisiae • Homo sapiens • Drosophila melanogaster • Helicobacter pylori • Caenorhabclitis elegans Search By Data Type • Proteomics • Genomics • Microarray

Query for Scripps Sample Data Search By Project/Experiment • Scripps MS testing project • Available peptide hit list • Retrieve peak information and • m/z & intensity list

Query for 2 D Gel Data Search By Experiment/Sample

Proteomics Database Architecture Three Phases of Database Design Production Design Normalized with Key-value Pair Process-Oriented Application Layer Stored Procedure for Analysis Pipeline 2D Gel MS LC Views -- materialized views Logical Layer NMR Y2H X-Ray Defraction X-Ray Cryoelectron Microscopy Protein Array Immnoaffinity Purification Multiple Schemas of Disparate Data Consolidate to One Schema to Remove Redundancy Physical Layer Final Views

Phase 2 Phase 1 Phase 3 Consolidation into a Few Schema Individual Dataset Modeling Analysis Pipeline Procedures DisparateData With Multiple Schemas A normalizeddata model implemented as key –value pairs, highlydecomposed. Logical Layer with Views for the User Test/stage PhysicalLayer Version 1 0.5-1 year Version 2 1-1.5 year Version 3 2 years Production Proteomics Database ArchitectureThree Database Instances Development • Partially Processed Data • Data Enhanced with Knowledge • Interface Less Changeable • Curated/Annotated Data

Status of VBI Database Development Schema Development Test/stage Production Adm +(10/10) + + 2 D Gel +(10/10) + + MS +(10/10) + + Interaction +(9/10) + - Pathway +(7/10) + - Data Repository +(8/10) + + Y2H +(10/10) + + Genomics +(10/10)(GUS) + + Microarray +(10/10) (AE) + + (Maturity) Default Tablespace:Admin_data, Genomics_TBLS, Pathway_TBLS, Microarray_TBLS, Proteomics_TBLS.

Generic Experiment Data Components • Who (People) • Where (Organization) • Project (Goal) • Materials and Methods (Metadata) • Results (Raw Data) • Conclusion and Hypothesis (Processed and Analyzed Data) -------Example of Database Design Logics

Y2H Data Component Modeling People Experiment Sample Project Results Conclusion Hypothesis DNA /Protein Detail

Experiment Component Object Model Experiment Experiment Design Design Description Experiment Factor Ontology Entry Factor Value Ontology entries are taking care of the annotation cases 1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type

Y2H Partial Database Schema

Proteomics DB System Architecture • Batch Processing • Data uploading; • Data validation; • Data analysis; • Data processing Perl, Java JSP, CGI, Java JDBC, Perl DBI/DBD, ODBC Private File Server Oracle Relational Database Public File Server

------- Data, Tool, Project, and Team Interoperability System Architecture of Putative VBI Proteomics Knowledgebase Web Display and Data Visualization Security Application Layer Security Service-Oriented MiddleWare with Process Control Temporary data Security Virtual Database/ Warehouse Security Mass Spectrometry Array Express Two Component System 2D Gel Structure Data Genomics Data

Strategy on Data Integration and Construction of Knowledge Warehouse

Biological Information Workflow Diagnostics, Therapeutics & Vaccines Target Discovery Biological Research Knowledge Generation Knowledge Management Data Management Curation and Annotation of Data Cleaning, Processing Algorithms Information Storage, Queries & DB Management

VBI PDC Project Phases Phase I Phase II Phase III First 2 years 3rd-4th years 5th year Knowledge generation Knowledge management Knowledge presentation Bio-IT Scope Data Integration • Raw data management • Schema development • Data visualization • Data standardization • Integration at interface level • Integration of data at DB level • Interoperability of datasets • Normalization and warehousing • Predefined query • Materialized view • Comparative analysis • Statistical analysis

Mapping the Proteome • (1) Yeast two-hybrid system • Measures association between • two proteins. • Allows very high throughput. • (2) Mass spectrometry • Allows identification of proteins within large complexes (2-100 proteins). • Lower throughput.

Infer Complex Interaction Topology PO4 Knowledgebase Binary interactions R2H Analysis Complex Interaction Model MS Analysis Proteins N-ary interations

Bacillus anthracis Data Organization (1) Completed Genome Ames, Ames Ancestor, a2012 NCBI, TIGR (2) Yeast two-hybrid interaction data Myriad Genetics (3) Mass Spectrometry Scripps and Caprion (4) Microarray expression profiling Univ. of Michigan (5) Interspecies and interspecies clustering NCBI(COG) and TIGR (6) Functional category assignment GU(PIR)

Strategy for Knowledgebase Construction (1) Annotation Improvement • Non-homologous based methods -------------- phylogenetic profiling, • Rosetta stone pattern, • operon analysis, • co-expression profiling, • gene neighboring etc. • (2) Comparative genomics with two reference genomes --- E. Coli and Yeast (2) Identifying anchor points for data integration • Known metabolic pathway – E. coli and yeast; • Known signal transduction pathway; • Known Gene regulation machinery; • Known Protein-protein interaction map.

Data Integration Lay down microarray data to add co-expression pattern to gene network Lay down MS multiple interaction data to expend the network Lay down Y2H interaction data and expend network Anchor on knowledge network of Reference Genomes – E. Coli and Yeast Comparative Genomics Improved annotation Genomics Data Putative Knowledgebase: No thing http://www.Bacillus_anthracis.org

Data Mining and Knowledge Augmentation Microarray Literature MS analysis Y2H analysis

Acknowledgement Organization Name Role Dr. Jeff Chen Project Manager/Investigator VBI Dr. Chendong Zhang Senior Software Engineer VBI Dr. Steve Cammer Bioinformatics Scientist VBI Dr. Oswald Crasta Scientist and CI-Co-director VBI Susan Baker DBA VBI Jiang Lu DBA VBI Ranjan Jha Software Engineer VBI Qiang Yu Software Engineer VBI Jian Li Software Engineer VBI Wei Sun Software Engineer VBI Chaitanya Kommidi Software Engineer VBI Dr.Bruno Sobral Co-PI VBI Dr. Peter MacGarvey Senior Bioinformatics Scientist GU Dr. Cathy Wu Co-PI GU Paula Yadvish Web Coordinator SSS Margaret Moore PI SSS

Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager