Bioinformatics: Applications

Bioinformatics: Applications ZOO 4903 Fall 2006, MW 10:30-11:45 Sutton Hall, Room 312 Jonathan Wren Protein-Protein Interaction Networks

Lecture overview • What we’ve talked about so far • Proteins & their domains • Protein 3D structure • Overview • Proteins do not function in a vacuum • Methods of detecting protein-protein interactions (PPI) • Structure and types of networks • Behavior of networks

Cells are crowded places! Hopper & Mayer, 1999, Prokaryotes. Am.Sci. 87:518

Importance of protein-protein interactions • Many cellular processes are regulated by multiprotein complexes • Distortions of protein interactions can cause diseases • Protein function can be predicted by knowing functions of interacting partners (“guilt by association”) A comparison of sequence (GenBank) and protein-protein interaction data (DIP database) Adapted from S. Fields, FEBS, 2005

Types of protein-protein interactions (PPI) Non-obligate PPI Obligate PPI usually permanent the protomers are not found as stable structures on their own in vivo Stable (many enzyme-inhibitor complexes) dissociation constant Kd=[A][B] / [AB] 10-7÷ 10-13 M Transient Weak (electron transport complexes) Kd mM-M Non-obligate transient homodimer, Sperm lysin (interaction is broken and formed continuously) Intermediate (antibody-antigen, TCR-MHC-peptide, signal transduction PPI), KdM-nM Strong (require a molecular trigger to shift the oligomeric equilibrium) KdnM-fM Obligate heterodimer Human cathepsin D Non-obligate permanent heterodimer Thrombin and rodniin inhibitor Bovine G protein dissociates into G and G subunits upon GTP, but forms a stable trimer upon GDP

Multiple interactions: Guanine-nucleotide binding protein Adapted from Vetter & Wittinghofer, Science 2001

Multiple interactions: Guanine-nucleotide binding protein Question: How conserved are the interactive vs non-interactive portions of this protein? Adapted from Vetter & Wittinghofer, Science 2001

Pair of duplicated proteins Pair of duplicated proteins Shared interactions Shared interactions Protein evolution - gene duplication Right after duplication Over time

Methods of identifying PPIs • Experimental • Protein-protein arrays • Y2H assay • TAP assay • Computational/Inferential • Interolog analysis • Co-localization, co-expression • Correlated mutations • Text-mining

Interologs • Homolog • Common ancestors • Common 3D structure • Common active sites • Ortholog • Derived from Speciation • Paralog • Derived from Duplication • Interolog • Conserved Protein-Protein Interaction Thus, finding one PPI may yield dividends!

Protein Arrays H Zhu et al (2000) “Analysis of yeast protein kinases using protein chips” Nature Genetics 26: 283-289

The Two-Hybrid System • Two hybrid proteins are generated with transcription factor domains • Both fusions are expressed in a yeast cell that carries a reporter gene whose expression is under the control of binding sites for the DNA-binding domain Activation Domain Prey Protein Bait Protein Binding Domain Reporter Gene

The Two-Hybrid System • Interaction of bait and prey proteins localizes the activation domain to the reporter gene, thus activating transcription. • Since the reporter gene typically codes for a survival factor, yeast colonies will grow only when an interaction occurs. Activation Domain Prey Protein Reporter mRNA Bait Protein Reporter mRNA Reporter mRNA Reporter mRNA Binding Domain Reporter mRNA Reporter Gene

Genome-wide analysis by Y2H • Matrix approach: a matrix of prey clones is added to the matrix of bait clones. Diploids where X and Y interact are selected based on the expression of a reporter gene. • Library approach: one bait X is screened against an entire library. Positives are selected based on their ability to grow on specific substrates. --------------------------------------------------------- Uetz et al Nature 2000 – 957 putative interactions in Yeast Rain et al Nature 2001 – 1,200 putative interactions in H. Pylori Ho et al Nature 2002 – 3,617 putative interactions in Yeast (Mass Spec) Adapted from B. Causier, Mass Spectroscopy Reviews, 2004

Advantages of Y2H • In vivotechnique, good approximation of processes which occur in higher eukaryotes. • Transient interactions can be determined, can predict the affinity of an interaction. • Can be used to detect potential interactions of genes not yet observed to be translated into proteins (e.g. rarely expressed) or novel constructs (e.g. therapeutics) • Relatively fast and efficient.

Disadvantages of Y2H • Fusion of a protein into chimeras can change the structure of a target • Protein interactions can be different in yeast and the organisms where the genes came from • It is difficult to target extracellular proteins • It is hard to detect interactions between proteins active only in a complex • Proteins which can interact in two-hybrid experiments, may never interact in vivo

Tandem affinity purification method (TAP) • Target protein ORF is fused with the DNA sequences encoding TAP tag; • Tagged ORFs are expressed in yeast cells and form native complexes; • The complexes are purified by TAP method; • Components of each complex are found by gel electrophoresis or MS.

Tandem affinity purification method (TAP) TAP tag consists of two IgG binding domains of Staphylococcus protein A and calmodulin binding peptide; -------------------------------------- 7123 interactions can be clustered into 547 complexes (Krogan et al, 2006) O. Puig et al, Methods, 2001

Differences and similarities between Y2H and MS-TAP • TAP permits protein complexes to be isolated, but cannot detect weak/transient PPIs • Both methods generate a lot of false positives, only ~50% interactions are biologically significant • Y2H is in vivo technique • MS can detect large stable complexes and networks of interactions

Text Mining • Searching Medline or PubMed for words or word combinations • Co-occurrence of terms is the simplest metric, yet lends to a higher FP rate • NLP methods are more specific (e.g., “X binds to Y”; “X interacts with Y”; “X associates with Y” etc.) yet are difficult to detect so it has a higher FN rate • Normally requires a list of known gene names or protein names for a given organism

Pre-BIND • Used Support Vector Machine (SVM) to scan literature for PPIs • Precision, accuracy and recall of 92% for correctly classifying PPI abstracts • Estimated to capture 60% of all abstracted protein interactions for a given organism Donaldson et al. BMC Bioinformatics 2003 4:11

Drosophila interaction map From: A Protein Interaction Map of Drosophila Giot et al. Science 302, 1727-1136 (2003)

Comparing large scale data of protein-protein interactions • All methods except for Y2H and synthetic lethality technique are biased toward abundant proteins. • PPI are biased toward certain cellular localizations. • Evolutionarily conserved proteins have much better coverage in Y2H than the proteins restricted to a certain organism. Von Mering et al, Nature, 2002

Functional organization of yeast proteome: network of protein complexes • Essential gene products are more likely to interact with essential rather than nonessential proteins • Orthologous proteins interact with complexes enriched with orthologs Gavin et al, Nature, 2002

PPI Databases online • DIP • http://dip.doe-mbi.ucla.edu/ • MIPS (small scale) • http://mips.gsf.de/proj/ppi/ • BIND (PPI, Prot-DNA, Prot-SM) • http://www.bind.ca (now owned by Unleashed) • OPHID (predicted interactions) • http://ophid.utoronto.ca/ophid/ • MINT - Molecular Interactions Database • http://mint.bio.uniroma2.it/mint/Welcome.do • IntAct (EBI) • http://www.ebi.ac.uk/intact/site/ • InterDom (domain interactions) • http://interdom.lit.org.sg/ • STRING (EMBL) • http://string.embl.de/

Types Experiment (E) Structure detail (S) Predicted Physical (P) Functional (F) Curated (C) Homology modeling (H) *International Molecular Exchange (IMEx) consortium Interaction databases

Comparing the DBs • High FP rate in high- throughput exp. • Disagreement between benchmark sets • Experimental PPI data is sparse relative to all PPIs, so dataset overlap is small and hard to confirm with multiple sources

PPI network properties Nodes & connections

Characteristics of networks n = nodes, k = connections or “edges” K=2 K=2 K=3 K=1 • In biology, n refers to genes/proteins (and/or metabolites) while k refers to interactions

Examples of networks: Proximity-based interactions

Examples of networks: Distant interactions

Elementary features:node (n) diversity and dynamics

Elementary features:edge (k) diversity and dynamics

Elementary features:Network Evolution

Network properties • Network Structure Metrics • Average path length • Degree distribution(connectivity) • Clustering coefficient • Network Structure Types • Regular • Random • Small-world • Scale-free

Structural metrics: Path length & network diameter

Structural Metrics:Degree distribution (connectivity)

Structural Metrics:Clustering coefficient

Network properties • Network Metrics • Average path length • Degree distribution(connectivity) • Clustering coefficient • Network Structures • Regular • Random • Small-world • Scale-free

Regular networks – fully connected

Regular networks –Lattice

Regular networks –Lattice: ring world

Random networks

Random Networks

Small-world networks

Exponential network degree distribution . . . .

Scale-free networks New nodes preferentially attach to highly connected ones Coined by A.L. Barabasi in 1998

Different network models: Barabasi-Alberts. Model of preferential attachment. • At each step, a new node is added to the graph. • The new node is attached to one of old nodes with probability proportional to the vertex degree. ln(P(k)) Degree distribution – power law distribution. ln(k) Barabasi & Albert, Science, 1999

Properties of scale-free networks. Multiplying k by a constant, does not change the shape of the distribution – scale free distribution. From T. Przytycka • Small diameter • Tolerance to errors and attacks • But: sub-networks can be scale-free while underlying degree distribution is not.

Difference between scale-free and random graph models. . Random networks are homogeneous, most nodes have the same number of links. Scale-free networks have a number of highly connected verteces. Adapted from Jeong et al, Nature, 2000

Bioinformatics: Applications

Bioinformatics: Applications

Presentation Transcript

Bioinformatics Applications

DISTRIBUTING BIOINFORMATICS APPLICATIONS WITH PIPER

Cluster Computer For Bioinformatics Applications

FutureGrid Cloud Technologies and Bioinformatics Applications

“Semantic Web” Applications in Bioinformatics

Applications to Bioinformatics: Microarray Data Mining

Bioinformatics Applications

Cloud Technologies and Bioinformatics Applications

Bioinformatics Applications of Machine Learning

Bioinformatics Applications and Workloads

Hashing Algorithm and its Applications in Bioinformatics

Compiling R for Performance in Bioinformatics Applications

Feature Selection and Bioinformatics Applications Isabelle Guyon

Applications to Bioinformatics: Microarray Data Mining

Machine Learning and its Applications in Bioinformatics

Applications to Bioinformatics

Cluster Computing Applications for Bioinformatics

BF528 - Applications in Translational Bioinformatics

EGEE-2 NA4 Biomed Bioinformatics Applications

Applications to Bioinformatics: Microarray Data Mining

Applications of Functional Genomics and Bioinformatics

Bioinformatics Applications in the Virtual Laboratory