190 likes | 378 Views
Laur Tooming. Graph mining in bioinformatics. Graphs in biology. Graphs are often used in bioinformatics for describing processes in the cell Vertices are genes or proteins The meaning of an edge depends on the type of the graph Protein-protein interaction Gene regulation.
E N D
Laur Tooming Graph mining in bioinformatics
Graphs in biology • Graphs are often used in bioinformatics for describing processes in the cell • Vertices are genes or proteins • The meaning of an edge depends on the type of the graph • Protein-protein interaction • Gene regulation
What we’re looking for • We want to find sets of genes that have a biological meaning. • Idea: find graph-theoretically relevant sets of vertices and find out if they are also biologically meaningful. • Simple example: connected components • A more advanced idea: graph clustering. Find subgraphs that have a high edge density.
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://micans.org/mcl/ Markov Cluster Algorithm (MCL) • If there is cluster structure in a graph, random walks tend to remain in a cluster for a long time • Graph modelled as a stochastic matrix: sum of entries in a column is 1 • aij- probability that randomly walking out of j will go to i on the next step • Bigger edge weight means greater probability of choosing that edge
Markov Cluster Algorithm (MCL) • Two procedures, inflation and expansion, are applied alternatively • Expansion: matrix squaring • considers longer random walks • Inflation: raising entries to some power, rescaling to remain stochastic • Weakens weak edges and strengthens strong ones • Converges to a steady state
Markov Cluster Algorithm (MCL) Images from http://micans.org/mcl/ani/mcl-animation.html
Betweenness centrality clustering • An edge between different clusters is on many shortest paths from one cluster to another. • An edge inside a cluster is on less shortest paths, because there are more alternative paths inside a cluster. • Betweenness centrality of an edge - the number of shortest paths in the graph containing that edge. • Remove edges with the highest centrality from the graph to obtain clustering. • Optimisations: • instead of all shortest paths, pick a sample of vertices and calculate shortest paths from them • remove several edges at once
GraphWeb • Web interface for analysing biological graphs • Simple syntax for entering graphs • multiple datasets • directed edges • edge weights • Visualising graphs with GraphViz • Finding biological meaning with g:Profiler ds1: A > B 10 ds2: A > B 4 ds1: B C 5 ds2: C > D 12
Combining several datasets • Whether or not there is an edge between two vertices is determined in biological experiments, which may sometimes give false results. • For a given graph different sources may give different information. Some sources may be more trustworthy than others. • We would like to combine different sources and assess the trustworthyness of each edge in the resulting graph. • Edge weight in summary graph: sum over datasets • w(e,G) = Σw(e,Gi)*w(Gi)