690 likes | 927 Views
Protein Interaction Module Detection using Matrix-Based Graph Algorithms. Chris Ding Lawrence Berkeley National Laboratory. Bioinformatics & Computational Biology Computational genomics : Molecular biology at genomic level. Genomics Research. More than 100 genomes’ DNA sequenced
E N D
Protein Interaction Module Detection using Matrix-Based Graph Algorithms Chris Ding Lawrence Berkeley National Laboratory
Bioinformatics & Computational BiologyComputational genomics: Molecular biology at genomic level
Genomics Research • More than 100 genomes’ DNA sequenced • DNA microarray chip technology • Protein – protein interaction technology • Gene knock-out for gene regulatory network • Many high-through technologies • Bio-imaging (embryos imaging, EM) • Huge number of databases • GenBank, Protein Data Bank, SCOP, Pfam • Gene Ontology
A Genomics Research Trend • Large # of genomes have been sequenced. • Traditional Approach: Predict genes, predict proteins, predict structures, prediction functions • This structural genomics is inadequate • Protein interactions: a new approach
Protein – Protein Interactions • Proteins carry out tasks together with other proteins • 83% proteins interact with others • Proteins interact in promoters • Multi-protein complexes (assemblies) • Synergistic interactions • Complex – complex cross-talks • Proteins work out in modular fashion • Gene regulation • Biological Pathway Most drug block certain pathways • Major goal of research: detect protein modules
Protein Interactions Antibody – antigen binding DOE Genome to Life
Protein interaction experiments • Two-hybrid Assay • Protein coordination in promoter region • Binary interactions • Capture transient and unstable interactions • Mass Spectrometry • TAP-MS: Tandem affinity purification • HMS-PCI: high throughput protein interaction id. • Use bait proteins • Capture multi-protein complexes • Problems: • Results do not agree • Lots of noise
Tandem-Affinity Purification with Mass-Spectrometry (TAP-MS) determines constituents of multi-protein complexes. Many baits are simultaneously processed to obtain many complexes Gavin, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002;415(6868):141-147. More reliable technology (Deng, et al)
Protein Interaction Experiments • Different experiments don’t agree: small overlap Salwinski and Eisenberg, 2003
Protein Interactions A genome has 5000 proteins. Each interacts ~ 5 others.
Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Cliques • Bi-cliques • Results
Bipartite Graph Model p –nodes: proteins c –nodes: protein complexes Protein Complex: p –nodes: proteins c –nodes: protein domains Protein domain:
Unified Representation of Protein Complex Data Input: Protein Complex data: B protein – protein network: protein complexe–protein complexnetwork: (Ding, He, Meraz, Holbrook, Proteins, 2004)
A B Co-location of domains: Bridged Bipartite Graph: Pfam domains match SCOP domain Matching : Reach 90% accuracy Compared to direct match (Zhang, Chandonia, Ding, Holbrook, BMC Bioinformatics 2004)
Protein Interaction Modules • Find highly connected regions: • Graph clustering • Cliques • Bi-cliques
Outline Protein interaction Interaction Data Graph models Spectral Clustering Cliques Bi-cliques
Spectral Clustering: MinMaxCut min between-cluster similaities (weights) max within-cluster similarities (weights) (Ding, RECOMB’02)
Spectral Clustering Method (MinMaxCut) • Minimize similarity between A,B: • Maximize similarity within A & B: Cluster membership indicator: Minimizing leads to Solution given by eigenvector Cluster assignment:
A NP-hard intractable combinatorial optimization problem can be effectively solved bya simple eigenvector !
Spectral Clustering • 2-way clustering • K-way clustering • Recursive 2-way clustering • K-way relaxation (K eigenvectors) • Cluster Self-aggregation and Perturbation Analysis • Characteristics • Principled approach • Clear and well-motivated clustering objective functions • Everything is proved rigorously • Based on well-established matrix/algebra theory • A rich framework (clustering, ordering, ranking, etc) • State of Art Algorithm
Recursive MinMaxCut Clustering of Lymphoma Issues (Alizadeh et al, 2000) • B cell lymphoma go thru different stages • 3 normal stages • 3 cancer stages • Key question:can we detect them automatically ? (Ding, RECOMB’02)
Gene expression of lymphoma (Stanford) (Ding, RECOMB’02)
Spectral Clustering • 2-way clustering • K-way clustering • Recursive 2-way clustering • K-way relaxation (K eigenvectors) (principled) • Cluster Self-aggregation and Perturbation Analysis • Characteristics • Principled approach • Clear and well-motivated clustering objective functions • Everything is proved rigorously • Based on well-established matrix/algebra theory • A rich framework (clustering, ordering, ranking, etc) • State of Art Algorithm
Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Application to computing protein interaction modules • Cliques • Bi-cliques • Results
Clustering Protein Complex Graph Input: Protein Complex data: B protein – protein network: protein complex–protein complexnetwork: (Ding, He, Meraz, Holbrook, Proteins, 2004)
Experimental Protein Complex Computed
Implications of discovered protein clusters on protein interactions: F-statistics F - statistics of amino acids and physical property across all protein clusters: statistical significance Lys, Arg, Asp are most significant: => electrostatic forces are dominant surface factors influencing protein interactions Surprise: secondary structure is not important factor in protein module formation
Protein Secondary Structure • Alpha helix • Beta sheet • Coil regions
Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Cliques • Bi-cliques • Results
Clique K-core Every node connects to everyone else Every node connecto to at leat 3 others Protein Interaction Modules • Find highly connected regions • cliques • k-core: subgraph with node degree > k
Clique Every node connects to everyone else Motzkin-Struss Formalism for computing maximal cliques Clique computing is NP-hard. Even approximating clique is hard. Motzkin-Straus Theorem. on all nodes of the graph Vector L1 enforce sparsity Non-zero entries define the clique
Generalized Motzkin-Straus Formalism on all nodes of the graph Vector s.t. L1 enforce sparsity Non-zero entries define the clique Setting =1.05 we can compute maximum clique better than standard approach =1.0. (Ding, Zhang, Holbrook, 2006)
s.t. Initialize update Algorithm for computing clique Solving the constrained quadratic programming problem Theorem 1. Correctness: Solution converges to local maxima Convergence: Iterative algorithm converges Theorem 2.
Update rule: satisfies KKT condition At Convergence Proof of Correctness Constrained Optimization Theory Introduce Lagrangian function KKT Optimality Condition (Complementarity Slackness): Lagrangian multipier value:
G(x,x’) is an auxiliary function of L(x) if We maximize a lower-bound. set L(x) is monotonically increasing and is bounded from up. Thus the algorithm converges Proof of Convergence Using Auxiliary Function (from Machine Learning)
Proof of Convergence (cont) Key: (1) find auxiliary function, (2) find global maxima The auxiliary function is First order derivative: 2nd order derivative: is negative definite Thus G(x,x’) is concave in x. Global maxima easily obtained.
Subunits Of SRP (signal recognition particle) Complex Clique: Srp19, Srp14, Srp21, Srp54, Srp68, Srp72 The clique also includes a yeast protein SRP21, which is not found in mammalian SRP; forms a pre-SRP structure in the nucleolus that is translocated to the cytoplasm (Halic et al, 2004)
Signal Recognition Particle (SRP) help proteins to pass through ER membrane ribosome Network to transport proteins and lipids Fig 27-33 Lehninger
Outline • Protein interaction • Interaction Data • Graph models • Spectral Clustering • Cliques • Bi-cliques • Results
Cliques in a bipartite graph • Finding a complete block in the adjacency matrix • Similarly to bi-clustering, widely used in bioinformatics • Example. Gene expression profiles: a gene is relevant only for certain subset of celluar processs, not all process. • Two types of maximal bi-cliques: • Maximum Node Bicliques: max |R|+|C| (perimeter) • Maximum Edge Bicliques: max |R|*|C| (area)
DNA Gene expression Lymphoma Cancer(Alizadeh et al, 2000) Genes Effects of feature selection: Select 900 genes out of 4025 genes Tissue sample
Generalized Motzkin-Strauss Theoremfor maximal edge biclique Given bipartite graph with adjacency matrix B. Compute maximal edge bi-clique. Generalized Motzkin-Strauss Theorem. Vector on row nodes of the graph Vector on column nodes of the graph s.t. Non-zero entries define the biclique (Ding, Zhang, Holbrook, 2006)
s.t. Theorem 1. At convergence, solution satisfies KKT condition. Theorem 2. Lmonotonically increases under update.Algorithm converges. Algorithm for computing bicliques Solving the constrained quadratic programming problem Update:
A New Upper Bound on the size of maximum-edge biclique Using the generalized Motzin – Strauss theorem, derive Largest singular value of B (Ding, Li, Jordan, 2007)
Biclique Example Solution vectorx Solution vectory