Protein Interaction Module Detection using Matrix-Based Graph Algorithms

Protein Interaction Module Detection using Matrix-Based Graph Algorithms Chris Ding Lawrence Berkeley National Laboratory

Bioinformatics & Computational BiologyComputational genomics: Molecular biology at genomic level

Genomics Research • More than 100 genomes’ DNA sequenced • DNA microarray chip technology • Protein – protein interaction technology • Gene knock-out for gene regulatory network • Many high-through technologies • Bio-imaging (embryos imaging, EM) • Huge number of databases • GenBank, Protein Data Bank, SCOP, Pfam • Gene Ontology

A Genomics Research Trend • Large # of genomes have been sequenced. • Traditional Approach: Predict genes, predict proteins, predict structures, prediction functions • This structural genomics is inadequate • Protein interactions: a new approach

Protein – Protein Interactions • Proteins carry out tasks together with other proteins • 83% proteins interact with others • Proteins interact in promoters • Multi-protein complexes (assemblies) • Synergistic interactions • Complex – complex cross-talks • Proteins work out in modular fashion • Gene regulation • Biological Pathway Most drug block certain pathways • Major goal of research: detect protein modules

Protein Interactions Antibody – antigen binding DOE Genome to Life

Protein interaction experiments • Two-hybrid Assay • Protein coordination in promoter region • Binary interactions • Capture transient and unstable interactions • Mass Spectrometry • TAP-MS: Tandem affinity purification • HMS-PCI: high throughput protein interaction id. • Use bait proteins • Capture multi-protein complexes • Problems: • Results do not agree • Lots of noise

Tandem-Affinity Purification with Mass-Spectrometry (TAP-MS) determines constituents of multi-protein complexes. Many baits are simultaneously processed to obtain many complexes Gavin, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002;415(6868):141-147. More reliable technology (Deng, et al)

Protein Interaction Experiments • Different experiments don’t agree: small overlap Salwinski and Eisenberg, 2003

Protein Interactions A genome has 5000 proteins. Each interacts ~ 5 others.

Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Cliques • Bi-cliques • Results

Bipartite Graph Model p –nodes: proteins c –nodes: protein complexes Protein Complex: p –nodes: proteins c –nodes: protein domains Protein domain:

Unified Representation of Protein Complex Data Input: Protein Complex data: B protein – protein network: protein complexe–protein complexnetwork: (Ding, He, Meraz, Holbrook, Proteins, 2004)

A B Co-location of domains: Bridged Bipartite Graph: Pfam domains match SCOP domain Matching : Reach 90% accuracy Compared to direct match (Zhang, Chandonia, Ding, Holbrook, BMC Bioinformatics 2004)

Protein Interaction Module:densely connected subgraphs

Protein Interaction Modules • Find highly connected regions: • Graph clustering • Cliques • Bi-cliques

Outline Protein interaction Interaction Data Graph models Spectral Clustering Cliques Bi-cliques

Spectral Clustering: MinMaxCut min between-cluster similaities (weights) max within-cluster similarities (weights) (Ding, RECOMB’02)

Spectral Clustering Method (MinMaxCut) • Minimize similarity between A,B: • Maximize similarity within A & B: Cluster membership indicator: Minimizing leads to Solution given by eigenvector Cluster assignment:

Graph clustering examples

A NP-hard intractable combinatorial optimization problem can be effectively solved bya simple eigenvector !

Spectral Clustering • 2-way clustering • K-way clustering • Recursive 2-way clustering • K-way relaxation (K eigenvectors) • Cluster Self-aggregation and Perturbation Analysis • Characteristics • Principled approach • Clear and well-motivated clustering objective functions • Everything is proved rigorously • Based on well-established matrix/algebra theory • A rich framework (clustering, ordering, ranking, etc) • State of Art Algorithm

Recursive MinMaxCut Clustering of Lymphoma Issues (Alizadeh et al, 2000) • B cell lymphoma go thru different stages • 3 normal stages • 3 cancer stages • Key question:can we detect them automatically ? (Ding, RECOMB’02)

Gene expression of lymphoma (Stanford) (Ding, RECOMB’02)

Spectral Clustering • 2-way clustering • K-way clustering • Recursive 2-way clustering • K-way relaxation (K eigenvectors) (principled) • Cluster Self-aggregation and Perturbation Analysis • Characteristics • Principled approach • Clear and well-motivated clustering objective functions • Everything is proved rigorously • Based on well-established matrix/algebra theory • A rich framework (clustering, ordering, ranking, etc) • State of Art Algorithm

Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Application to computing protein interaction modules • Cliques • Bi-cliques • Results

Clustering Protein Complex Graph Input: Protein Complex data: B protein – protein network: protein complex–protein complexnetwork: (Ding, He, Meraz, Holbrook, Proteins, 2004)

Computed Protein Clusters

Experimental Protein Complex Computed

Implications of discovered protein clusters on protein interactions: F-statistics F - statistics of amino acids and physical property across all protein clusters: statistical significance Lys, Arg, Asp are most significant: => electrostatic forces are dominant surface factors influencing protein interactions Surprise: secondary structure is not important factor in protein module formation

Protein Secondary Structure • Alpha helix • Beta sheet • Coil regions

Clique K-core Every node connects to everyone else Every node connecto to at leat 3 others Protein Interaction Modules • Find highly connected regions • cliques • k-core: subgraph with node degree > k

Clique Every node connects to everyone else Motzkin-Struss Formalism for computing maximal cliques Clique computing is NP-hard. Even approximating clique is hard. Motzkin-Straus Theorem. on all nodes of the graph Vector L1 enforce sparsity Non-zero entries define the clique

Generalized Motzkin-Straus Formalism on all nodes of the graph Vector s.t. L1 enforce sparsity Non-zero entries define the clique Setting =1.05 we can compute maximum clique better than standard approach =1.0. (Ding, Zhang, Holbrook, 2006)

s.t. Initialize update Algorithm for computing clique Solving the constrained quadratic programming problem Theorem 1. Correctness: Solution converges to local maxima Convergence: Iterative algorithm converges Theorem 2.

Update rule: satisfies KKT condition At Convergence Proof of Correctness Constrained Optimization Theory Introduce Lagrangian function KKT Optimality Condition (Complementarity Slackness): Lagrangian multipier value:

G(x,x’) is an auxiliary function of L(x) if We maximize a lower-bound. set L(x) is monotonically increasing and is bounded from up. Thus the algorithm converges Proof of Convergence Using Auxiliary Function (from Machine Learning)

Proof of Convergence (cont) Key: (1) find auxiliary function, (2) find global maxima The auxiliary function is First order derivative: 2nd order derivative: is negative definite Thus G(x,x’) is concave in x. Global maxima easily obtained.

Partial list of Discovered Cliques

Subunits Of SRP (signal recognition particle) Complex Clique: Srp19, Srp14, Srp21, Srp54, Srp68, Srp72 The clique also includes a yeast protein SRP21, which is not found in mammalian SRP; forms a pre-SRP structure in the nucleolus that is translocated to the cytoplasm (Halic et al, 2004)

Signal Recognition Particle (SRP) help proteins to pass through ER membrane ribosome Network to transport proteins and lipids Fig 27-33 Lehninger

Cliques in a bipartite graph • Finding a complete block in the adjacency matrix • Similarly to bi-clustering, widely used in bioinformatics • Example. Gene expression profiles: a gene is relevant only for certain subset of celluar processs, not all process. • Two types of maximal bi-cliques: • Maximum Node Bicliques: max |R|+|C| (perimeter) • Maximum Edge Bicliques: max |R|*|C| (area)

Bicliques in a 2D Dataset

DNA Gene expression Lymphoma Cancer(Alizadeh et al, 2000) Genes Effects of feature selection: Select 900 genes out of 4025 genes Tissue sample

Generalized Motzkin-Strauss Theoremfor maximal edge biclique Given bipartite graph with adjacency matrix B. Compute maximal edge bi-clique. Generalized Motzkin-Strauss Theorem. Vector on row nodes of the graph Vector on column nodes of the graph s.t. Non-zero entries define the biclique (Ding, Zhang, Holbrook, 2006)

s.t. Theorem 1. At convergence, solution satisfies KKT condition. Theorem 2. Lmonotonically increases under update.Algorithm converges. Algorithm for computing bicliques Solving the constrained quadratic programming problem Update:

A New Upper Bound on the size of maximum-edge biclique Using the generalized Motzin – Strauss theorem, derive Largest singular value of B (Ding, Li, Jordan, 2007)

Biclique Example Solution vectorx Solution vectory

Protein Interaction Module Detection using Matrix-Based Graph Algorithms

Protein Interaction Module Detection using Matrix-Based Graph Algorithms

Presentation Transcript

Protein-protein interaction

Community Detection and Graph-based Clustering

Matrix Multiplication and Graph Algorithms

V3 Matrix algorithms and graph partitioning

V4 Matrix algorithms and graph partitioning

Graph Algorithms

Lecture 30: ADJACENCY-Matrix based Graph

Matrix Multiplication and Graph Algorithms

Graph Algorithms

Graph Algorithms

Protein – protein interaction

Graph Algorithms

Median Filtering Detection Using Edge Based Prediction Matrix

Graph Algorithms

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Graph Algorithms

Graph Algorithms

Graph Algorithms