By Xianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009

Bioinformatics Cyber-infrastructureforGenomics andProteomics in Systems Biology ByXianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009

Agenda Today • Cyber-infrastructure and systems biology. • (2) High performance computing and software for peptide/protein • identification and quantification, data mining/target discovery, • on mass spectrometry generated proteomics data. • (3) Relational database management system, genome • annotation methodology, systems biology data integration, • biology knowledge generation and augmentation.

Section One: Cyber-infrastructure and Systems Biology Reductionist approach, one gene, one protein Systems approach, multiple genes, network analysis Cutting edge science and technology

Status of Technologies in Systems Biology

Cyber-infrastructure for Systems Biology • http://www.communitytechnology.org/nsf_ci_report/ • “…. build new types of scientific and engineering knowledge environments and organizations to pursue research in new ways and with increased efficacy. • …..new NSF funding of $1 billion per year is needed to achieve critical mass ……. 2008 Awarded $50 millions 2004 Awarded $85 millions 2004 Awarded to $100 millions

Supporting Cyber- infrastructure and Systems Biology Workflow Historic strong area Supporting

Cyber-knowledge System to Enable Genomics-based Predicative Medicine (DOE - Genomics: GTL Roadmap, p.52)

Core Laboratory Facility: Data Generation Core Computational Facility: Data Processing, Storage, and Dissemination Cyber-infrastructure, Data Management, Data Analysis Pipeline, and Data Display System Integration at Systems Biology Center (1) LIMS for raw data & protocol (2) Preprocessed data management (3) High throughput computing (4) Data validation and integration (5) Knowledge representation Data Mining and Knowledge Discovery

Cyber-infrastructure Component (1) : High Performance Computing --- Migration of Bio-Computing Capability Step 1 Step 2 Start point PC Single CPU Computing Unix Multiple CPUs Computing Cluster Computing 2-4 biological labs 5-10 biological labs in US Most labs For large sets of data analysis

Cyber-infrastructure Component (2) : Integrated Knowledgebase System --- Case Study of National Biodefense Proteomics Data Center

Section Two: High Performance Computing and Proteomics ---- System Integration Case 1: UVa Proteomics Data Center High Performance and Throughput Computing Data Management Data Management

Computational Proteomics Software and Algorithms Protein Database Search Engines Mascot Matrix Science Sequest / Bioworks Scripps/Thermo X! Tandem the GPM Spectrum Mill Agilent Technologies OMSSA NCBI PEAKS Bioinformatics Solutions Inc. Phenyx GeneBio Statistical Validation and Quantitation PeptideProphet Institute for Systems Biology ProteinProphet Institute for Systems Biology ASAPRatio, XPRESS, Libra Institute for Systems Biology Scaffold Batch System Proteome Software, Inc. SIEVE Thermo Census Scripps Research Institute Open Data Standards FuGE and XAR FHCRC, ICBC, ITMAT, & Manchester MIAPE HUPO PSI and Collaborators mzXML, pepXML, protXML Institute for Systems Biology MS1, MS2, SQT Scripps Research Institute Many more ……..…

System Integration Case 2: National Biodefense Proteomics Data Center http://www.proteomicsresource.org Awarded $14 millions

Proteomics Research Centers (PRC) and Their Major Data Types PRC Organizations Major Data Types (1) University of Michigan Microarray and mass spectrometry (2) Caprion Pharmaceuticals Mass spectrometry (3) Harvard Proteomics Institute Genomics and protein expression array (4) Albert Einsten College of Medicine Mass spectrometry (5) PNNL Mass spectrometry (6) Scripps NMR structural, X-ray crystal diffraction data, and Mass spectrometry (7) Myriad Genetics Yeast two-hybrid system

Proteomics Data Flow Data Modeling / Decomposition 2D GELS Protein Array LC Immunoaffinity purification Y2H MS MS/MS NMR X-Ray Cryoelectron Microscopy X-Ray Defraction etc… PRCS Converting to Standard Format QA & QC QA & QC Standard Format VBI Quality Assurance & QualityControl Standard Format for Each Data Type Public Relational Database Quality Assurance & QualityControl Data Sources Data Types MIAME and MIAPE-like Standards/SOP for Data Submission

Proteomics Database Architecture

Databases in Proteomics Data Center Search By Experiment/Sample

Strategies for Annotating Raw Data into Meaningful Knowledge • Annotation improvement and interaction network analysis (1) Non-homologous based methods -------------- Phylogenetic profiling, Rosetta stone pattern, Operon analysis, Co-expression profiling, Gene neighboring etc. (2) Comparative genomics with reference genomes --- E. coli, yeast, Arabidopsis, etc. model organisms. • Identifying anchor points for data integration (1) Known metabolic pathway; (2) Known signal transduction pathway; (3) Known gene regulation machinery; (4) Known protein-protein interaction map. BMC Bioinformatics 2006, 7 (Suppl 4):S18

Qualitative Data Integration and Knowledge Augmentation Based on Networks Biology

Quantitative Proteome Profiling --- The field is 2-3 years old Thermo SIEVE Scatter Plot of 14 UVa Raw Files for Validation of Data Quality and Absolute Quantification. Scaffold Capability of Proteome Spectra Counts of Semi-quantification.

Search Engine Comparison at UVa Proteomics Data Center (1) Low annotation rates Few common annotations

Peptide/Protein Identifications with Various Protein Database Search Engines (2) X!Tandem missed OMSSA missed Sequest over-predicted

UVaPDC, MS/MS Search Engine Comparison (3) Common annotations Statistics on confident values Spectra counts

Statistics and Summarization Capability of Scaffold --- The best feather of the software

Data Mining on Data Processed via Computational Approach Knowledge-based Discovery

Inference on Gene Network in Systems Biology Identified Knowledge Inference (1) Y2H, (2) MS pull down assay, (3) Co-expression assay. Knowledge Inference Rate limited step Identified Target/lead protein Where are the significant regulatory steps impacting pathway expression ?

Healthy Individual Patient with Bladder Cancer Urine Urine Exosomes Urine Microparticles Ectosomes Gγ LC-MS/MS Western Blotting Gβ SEQUEST EPS8L2 Spectral Count Analysis Urinary Biomarker Identification ---EGFR Pathway Related Bladder Cancer ----- Small scale analysis Mucin-4* EGFR Adenylate Cyclase P cAMP P ATP Gα* Gγ Gβ GTP NRas* EPS15 Gα* EPS8L1* or EPS8L2* GTP EDH1 Raf GDP MAPK Cell Proliferation MP Formation * Differentially expressed

Patten Matching on Gene Signatures at Various Biological States --- Large-scale analysis *** query signatures are compared to reference gene/protein expression signatures for known perturbations or disease phenotypes. (many to many association analysis)

Section Three : Knowledge Base Establishment Database Case 1  Soybean Upstream Regulatory Elements for Ongoing Regulatory Motif Annotation

Nominated Transcription Factor Involved in Stress Response Implicated in regulating wounding and jasmonate responses Soybean Promoter : GmERFs, Gmubis, Gmcons, GmWRKYs more and more and more…….. 10 promoters per month Group IX Promoter Red Dot = Soybean ERF genes

Ongoing Effort on Transcription Factor Binding Motifs ---- Identify genetic circuits of cell wall, starch, and lipid biosynthesis and degradation

Elucidation of Conserved Co-expression Networks via Data Integration with Expression Profiling Data

Database Case 2  CGKB and TOBFAC Knowledge Bases • BMC Bioinformatics. 2007, 8:129. • BMC Bioinformatics. 2008, 9:53.

Genome Annotation Strategy (1) : Homology-based Annotation High level coding region detection ! BMC Genomics. 2008, 9:103. 263,425 total cowpea gene space sequence (GSS).

Genome Annotation Strategy (2) : Metabolic Pathway Integration BMC Bioinformatics. 2007, 8:129.

Genome Annotation Strategy (3) : GO Integration with Distribution of Function Assignments BMC Genomics. 2008, 9:103.

Genome Annotation Strategy (4): Comparative Genomics at Genome-scale ---- Example of medicago vs cowpea BMC Genomics. 2008, 9:103.

Genome Annotation Strategy (5): Comparison at Gene Family Level --- WRKY and CONSTANS (CO) and CO-like Gene Families of Cowpea Transcription Factors • BMC Genomics. 2008, 9:103. • Plant Physiology. 2008, 147:280-295.

Genome Annotation Strategies: (6) Repeat, (7) Domain, (8) Gene Model Repeat Domain GeneModel BMC Bioinformatics. 2007, 8:129.

Genome Annotation Strategy (9) : Comparative Genomics on Network for Conserved Protein Complexes Conserved networks Comparative genome analysis

Published Protein-Protein (PPI) Interactions in Organisms Example of Yeast PPI

Genome Annotation Strategy (10): Functional Validation of Genes of Interest Through Reverse Genetics Program 2008 My name

Acknowledgement

By Xianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009

By Xianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009

Presentation Transcript

By Jeff Haas Katie Sandy Ryan Beebe Gil Herskovitz

By Jeff Osborne

Jeff kinney BY: DECLAN HUMPHREY-MARTIN

Presented By:

My book by

Careers by Tylon

Tiger By: Jeff Stone

By Jeff Adler

Breakfast by Jeff Moss

Jeff

Ceramics.

Rewind, Replay, Repeat by Jeff Bell

Xianfeng Song, Department of Physics, Indiana University

Presentation by :

Xianfeng Song, Department of Physics, Indiana University

Diary of a Wimpy kid by Jeff Kenny

Rise to Rebellion by Jeff Shaara

By:

By Jeff Burklo, Director

Jeff Jacoby Voting by Mail

[PDF] Free Download The Underwater Welder By Jeff Lemire

[PDF] Free Download Scrum By Jeff Sutherland