440 likes | 570 Views
Bioinformatics Cyber-infrastructure for Genomics and Proteomics in Systems Biology. By Xianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009. Agenda Today. Cyber-infrastructure and systems biology. (2) High performance computing and software for peptide/protein
E N D
Bioinformatics Cyber-infrastructureforGenomics andProteomics in Systems Biology ByXianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009
Agenda Today • Cyber-infrastructure and systems biology. • (2) High performance computing and software for peptide/protein • identification and quantification, data mining/target discovery, • on mass spectrometry generated proteomics data. • (3) Relational database management system, genome • annotation methodology, systems biology data integration, • biology knowledge generation and augmentation.
Section One: Cyber-infrastructure and Systems Biology Reductionist approach, one gene, one protein Systems approach, multiple genes, network analysis Cutting edge science and technology
Cyber-infrastructure for Systems Biology • http://www.communitytechnology.org/nsf_ci_report/ • “…. build new types of scientific and engineering knowledge environments and organizations to pursue research in new ways and with increased efficacy. • …..new NSF funding of $1 billion per year is needed to achieve critical mass ……. 2008 Awarded $50 millions 2004 Awarded $85 millions 2004 Awarded to $100 millions
Supporting Cyber- infrastructure and Systems Biology Workflow Historic strong area Supporting
Cyber-knowledge System to Enable Genomics-based Predicative Medicine (DOE - Genomics: GTL Roadmap, p.52)
Core Laboratory Facility: Data Generation Core Computational Facility: Data Processing, Storage, and Dissemination Cyber-infrastructure, Data Management, Data Analysis Pipeline, and Data Display System Integration at Systems Biology Center (1) LIMS for raw data & protocol (2) Preprocessed data management (3) High throughput computing (4) Data validation and integration (5) Knowledge representation Data Mining and Knowledge Discovery
Cyber-infrastructure Component (1) : High Performance Computing --- Migration of Bio-Computing Capability Step 1 Step 2 Start point PC Single CPU Computing Unix Multiple CPUs Computing Cluster Computing 2-4 biological labs 5-10 biological labs in US Most labs For large sets of data analysis
Cyber-infrastructure Component (2) : Integrated Knowledgebase System --- Case Study of National Biodefense Proteomics Data Center
Section Two: High Performance Computing and Proteomics ---- System Integration Case 1: UVa Proteomics Data Center High Performance and Throughput Computing Data Management Data Management
Computational Proteomics Software and Algorithms Protein Database Search Engines Mascot Matrix Science Sequest / Bioworks Scripps/Thermo X! Tandem the GPM Spectrum Mill Agilent Technologies OMSSA NCBI PEAKS Bioinformatics Solutions Inc. Phenyx GeneBio Statistical Validation and Quantitation PeptideProphet Institute for Systems Biology ProteinProphet Institute for Systems Biology ASAPRatio, XPRESS, Libra Institute for Systems Biology Scaffold Batch System Proteome Software, Inc. SIEVE Thermo Census Scripps Research Institute Open Data Standards FuGE and XAR FHCRC, ICBC, ITMAT, & Manchester MIAPE HUPO PSI and Collaborators mzXML, pepXML, protXML Institute for Systems Biology MS1, MS2, SQT Scripps Research Institute Many more ……..…
System Integration Case 2: National Biodefense Proteomics Data Center http://www.proteomicsresource.org Awarded $14 millions
Proteomics Research Centers (PRC) and Their Major Data Types PRC Organizations Major Data Types (1) University of Michigan Microarray and mass spectrometry (2) Caprion Pharmaceuticals Mass spectrometry (3) Harvard Proteomics Institute Genomics and protein expression array (4) Albert Einsten College of Medicine Mass spectrometry (5) PNNL Mass spectrometry (6) Scripps NMR structural, X-ray crystal diffraction data, and Mass spectrometry (7) Myriad Genetics Yeast two-hybrid system
Proteomics Data Flow Data Modeling / Decomposition 2D GELS Protein Array LC Immunoaffinity purification Y2H MS MS/MS NMR X-Ray Cryoelectron Microscopy X-Ray Defraction etc… PRCS Converting to Standard Format QA & QC QA & QC Standard Format VBI Quality Assurance & QualityControl Standard Format for Each Data Type Public Relational Database Quality Assurance & QualityControl Data Sources Data Types MIAME and MIAPE-like Standards/SOP for Data Submission
Databases in Proteomics Data Center Search By Experiment/Sample
Strategies for Annotating Raw Data into Meaningful Knowledge • Annotation improvement and interaction network analysis (1) Non-homologous based methods -------------- Phylogenetic profiling, Rosetta stone pattern, Operon analysis, Co-expression profiling, Gene neighboring etc. (2) Comparative genomics with reference genomes --- E. coli, yeast, Arabidopsis, etc. model organisms. • Identifying anchor points for data integration (1) Known metabolic pathway; (2) Known signal transduction pathway; (3) Known gene regulation machinery; (4) Known protein-protein interaction map. BMC Bioinformatics 2006, 7 (Suppl 4):S18
Qualitative Data Integration and Knowledge Augmentation Based on Networks Biology
Quantitative Proteome Profiling --- The field is 2-3 years old Thermo SIEVE Scatter Plot of 14 UVa Raw Files for Validation of Data Quality and Absolute Quantification. Scaffold Capability of Proteome Spectra Counts of Semi-quantification.
Search Engine Comparison at UVa Proteomics Data Center (1) Low annotation rates Few common annotations
Peptide/Protein Identifications with Various Protein Database Search Engines (2) X!Tandem missed OMSSA missed Sequest over-predicted
UVaPDC, MS/MS Search Engine Comparison (3) Common annotations Statistics on confident values Spectra counts
Statistics and Summarization Capability of Scaffold --- The best feather of the software
Data Mining on Data Processed via Computational Approach Knowledge-based Discovery
Inference on Gene Network in Systems Biology Identified Knowledge Inference (1) Y2H, (2) MS pull down assay, (3) Co-expression assay. Knowledge Inference Rate limited step Identified Target/lead protein Where are the significant regulatory steps impacting pathway expression ?
Healthy Individual Patient with Bladder Cancer Urine Urine Exosomes Urine Microparticles Ectosomes Gγ LC-MS/MS Western Blotting Gβ SEQUEST EPS8L2 Spectral Count Analysis Urinary Biomarker Identification ---EGFR Pathway Related Bladder Cancer ----- Small scale analysis Mucin-4* EGFR Adenylate Cyclase P cAMP P ATP Gα* Gγ Gβ GTP NRas* EPS15 Gα* EPS8L1* or EPS8L2* GTP EDH1 Raf GDP MAPK Cell Proliferation MP Formation * Differentially expressed
Patten Matching on Gene Signatures at Various Biological States --- Large-scale analysis *** query signatures are compared to reference gene/protein expression signatures for known perturbations or disease phenotypes. (many to many association analysis)
Section Three : Knowledge Base Establishment Database Case 1 Soybean Upstream Regulatory Elements for Ongoing Regulatory Motif Annotation
Nominated Transcription Factor Involved in Stress Response Implicated in regulating wounding and jasmonate responses Soybean Promoter : GmERFs, Gmubis, Gmcons, GmWRKYs more and more and more…….. 10 promoters per month Group IX Promoter Red Dot = Soybean ERF genes
Ongoing Effort on Transcription Factor Binding Motifs ---- Identify genetic circuits of cell wall, starch, and lipid biosynthesis and degradation
Elucidation of Conserved Co-expression Networks via Data Integration with Expression Profiling Data
Database Case 2 CGKB and TOBFAC Knowledge Bases • BMC Bioinformatics. 2007, 8:129. • BMC Bioinformatics. 2008, 9:53.
Genome Annotation Strategy (1) : Homology-based Annotation High level coding region detection ! BMC Genomics. 2008, 9:103. 263,425 total cowpea gene space sequence (GSS).
Genome Annotation Strategy (2) : Metabolic Pathway Integration BMC Bioinformatics. 2007, 8:129.
Genome Annotation Strategy (3) : GO Integration with Distribution of Function Assignments BMC Genomics. 2008, 9:103.
Genome Annotation Strategy (4): Comparative Genomics at Genome-scale ---- Example of medicago vs cowpea BMC Genomics. 2008, 9:103.
Genome Annotation Strategy (5): Comparison at Gene Family Level --- WRKY and CONSTANS (CO) and CO-like Gene Families of Cowpea Transcription Factors • BMC Genomics. 2008, 9:103. • Plant Physiology. 2008, 147:280-295.
Genome Annotation Strategies: (6) Repeat, (7) Domain, (8) Gene Model Repeat Domain GeneModel BMC Bioinformatics. 2007, 8:129.
Genome Annotation Strategy (9) : Comparative Genomics on Network for Conserved Protein Complexes Conserved networks Comparative genome analysis
Published Protein-Protein (PPI) Interactions in Organisms Example of Yeast PPI
Genome Annotation Strategy (10): Functional Validation of Genes of Interest Through Reverse Genetics Program 2008 My name