STRING Prediction of protein networks through integration of diverse large-scale data sets

STRINGPrediction of protein networks throughintegration of diverse large-scale data sets Lars Juhl JensenEMBL Heidelberg

STRING integrates many types of evidence Genomic neighborhood Species co-occurrence Gene fusions Exp. interaction data Microarray expression data Database imports Literature co-mentioning

Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Combine evidence from experiments Infer associations in other species Integrating physical interaction screens

Gene fusion: predicting physical interactions Detect multiple proteins matching to one protein Exclude overlapping alignments Calibrate against KEGG maps Infer associations in other species

Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species Mining microarray expression databases

Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species Gene neighborhood: predicting co-expression

Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species

Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps Phylogenetic profile: co-mentioning in genomes

Multiple evidence types from several species

Score calibration against a common reference • Many diverse types of evidence • The quality of each is judged by very different raw scores • These are all calibrated against the same reference set • Requirements for a reference • Must represent a compromise of the all types of evidence • Broad species coverage • Both a strength and a weakness • Scores for all evidence types are directly comparable • The type of interaction is currently not predicted

Getting more specific – generally speaking

Other possible improvements • Bidirectionally transcribed gene pairs: a new genomic context method that may work on eukaryotes too[Korbel et al., Nature Biotechnology 2004] • Information extraction from PubMed using shallow parsing[Saric et al., Proceedings of ACL 2004] • Add more types of experiment types, e.g. protein expression levels • Infer functional relations from feature similarity • Hook up STRING with a robot 

The STRING team Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Mathilde Foglierini Peer Bork ArrayProspector web service Julien Lagarde Chris Workman NetView visualization tool Sean Hooper Analysis of yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak Web resources string.embl.de www.bork.embl.de/ArrayProspector www.bork.embl.de/synonyms Acknowledgments

Thank you!

STRING Prediction of protein networks through integration of diverse large-scale data sets

STRING Prediction of protein networks through integration of diverse large-scale data sets

Presentation Transcript

Proteomics Analysis and integration of large-scale data sets

Networks of Protein Interactions Construction of Networks from Diverse Data Sources

Analysis of Large-Scale Cell Phone Networks

Challenges of Analyzing Large Environmental Data Sets

Selecting Diverse Sets of Compounds

Protein Data Integration through Ontologies

using large data sets

STRING Modeling of biological systems through cross-species data integration

STRING Large-scale data and text mining

The Large-Scale Structure of Semantic Networks

Efficient Simulation of Large-Scale P2P Networks: Compact Data Structures

Large Scale Data Integration

Large Scale IP Networks

Large-Scale Protein Production

using large data sets

using large data sets

Manipulating Large Data Sets

STRING Modeling of biological systems through cross-species data integration