360 likes | 613 Views
Optimal Network Alignment with Graphlet Degree Vectors. Tijana Milenković ( Department of Computing, Imperial College London && Department of Computer Science, University of California ) Weng Leong Ng ( Department of Computer Science, University of California),
E N D
Optimal Network Alignment with Graphlet Degree Vectors TijanaMilenković(Department of Computing, Imperial College London && Department of Computer Science, University of California) WengLeong Ng (Department of Computer Science, University of California), Wayne Hayes(Department of Computer Science, University of California && Department of Mathematics, Imperial College London) NatašaPržulj (Department of Computing, Imperial College London) Cancer Informatics 2010 Presented by: LilaShnaiderman
Motivation • Lately, advances in experimental techniques: • yeast two-hybrid assay, • Mass spectrometry of purified complexes, • genome-wide chromatin immunoprecipitation, • etc. • So, increasing amounts of biological network data becoming available! • Comparative analyses of biological networks have as large an impact as comparative genomics on: • understanding of biology • Evolution • disease • So, meaningful network comparisons across species becomes one of the foremost problems in evolutionary and systems biology!!!
Background • Subgraphisomorphism problem: • Is one graph exists as an exact subgraph of another graph. • NP-complete complexity • So, network comparisons are computationally infeasible… • Network alignment: • The most common network comparison method. • Is more general problem: • Find the best way to “fit” a graph into another graph (not an exact subgraph) • Unclear: • how to guide the alignment process • how to measure the “goodness” of an inexact fit • So, heuristic strategies must be sought
Background – alignment types • Local alignment: • The majority of existing methods. • match a small sub network from one network to one or more sub networks in another network. • Can be ambiguous… • Global alignment: • Measures the overall similarity between two networks. • Aligns every node in the smaller network to exactly one node in the larger network. • most existing methods incorporate some a priori information external to network topology • like protein sequence similarities in PPIs networks, etc. • Best known global networkalignment algorithm based solely on network topology: • GRAphALigner (GRAAL): uses a heuristic search strategy to quickly find approximate alignments
Current solution: H-GRAAL • Hungarian-algorithm basedGRAAL • More expensive • Guaranteed to find optimal alignments relative to any fixed, deterministic cost function. • Relies solely and explicitlyon a strong and direct measure of network topological similarity. • Applicable to any type of networks • Allows to transfer the knowledge between aligned networks.
Graphlet degree vectors (1) • A small connected induced sub graph of a larger network. 4 1 6 8 0 3 5 7 2 G1 G0 G2 G4 G5 G3 13 10 14 12 11 9 G6 G7 G8
Graphlet degree vectors (2) • Graphlet degrees vector of node V: counts the number of different graphletsthat the node touches (for all graphletson 2 to 5 nodes). 0 v v v v
Graphlet degree vectors (3) 1 2 v v orbit
Graphlet degree vectors (4) 1 2 v v v v v
Graphlet degree vectors (4) 4 3 ? 5 v
Graphlet degree vectors (5) 4 5 v v v v
Graphlet degree vectors (6) 6 8 7 v v v
Graphlet degree vectors (7) What is the degree of node V(according to the vector)? v v The signature of node V 10 There are 73 different orbits across all 2-5-node graphlets 11 9
Degree Vector - Signature • Many real-world Networks: • Have a small-world nature • So, degree Vector is an effective measure: • Looks at network distance of 4 around a node • Captures a large portion of network topology • Thus, comparing two signatures: • Highly constraining measure of local topological similarity between nodes.
Signature similarity • For uG, ui: = • the ithcoordinate of its signature vector. • Distance: • wiis the weight of orbit i. • Accounts for dependencies between orbits • higher weights to orbits that are not affected by many other orbits • Questions: • Why log? • Why “+1”?
Distance and Similarity • Total Distance: • in (0,1) • O means: u,v identical • Similarity: S(u,v) = 1-D(u,v)
H-GRAAL algorithm-definitions • G1 and G2 are networks: • |V(G1)|<|V(G2)| • Alignment of G1 to G2: • set of ordered pairs (u,v), u ∈ V (G1) and v ∈ V (G2) • no two ordered pairs share the same G1-node or the same G2-node. • Each pair called aligned pair. • Maximum alignment: • Every G1-node is in some aligned pair • From now on: alignment=maximum alignment
H-GRAAL algorithm • H-GRAAL: • Hungarian-algorithm-based GRAphAligner • Produces an alignment: • of minimum total cost between networks • total cost: summed over all aligned pairs • aligned pair cost: based on signature similarity • The cost of aligning u and v: • favors alignment of the densest parts of the networks; • Reduced as the degrees of both nodes increase: higher degree nodes with similar signatures provide a tighter constraint • α ∈ [0, 1]: weighs the cost-function contributions of the node signature similarity between u and v • 1 − α: weights the contribution of nodes degrees.
Alignment Cost • Cost=0: a pair of topologically identical nodes u and v • Cost close to 2: a pair of topologically very different nodes. • Any problem with this formula? • T(u,v) for most nodes is very low: • As, there is small number of hubs (highly-linked nodes), • So max_deg(G1) and max_deg(G2) are much larger than deg(u) and deg(v).
Hungarian Algorithm • solves the assignment problem in polynomial time: • Create two bipartite graphs V(G1),V(G2). • Edge (u,v) from V(G1) to V(G2): labeled with the node alignment cost. • Find perfect match between them (with minimal cost). • More than one optimal alignment is possible: • the particular found alignment is highly dependent on the implementation details of the underlying Hungarian algorithm. • For example: the order of presenting the nodes to the algorithm
Finding Few Optimal Alignment • Can learn about all possible optimal matchings. • Make H-GRAAL to give more alignments: • “Remove” (u,v): raise the alignment cost of a node-pair (u,v) in A0to +∞ • Run H-GRAAL again • Found alignment with higher cost than A0, “Remove” different edge. • After trying to “remove” all edges, if not found alignment with optimal cost, no more optimal alignments exist. • This process has too high complexity… • O(|V(G1)|3x||E(G1)|) • There exist a fix O(|V(G1)|2x||E(G1)|) (based on dynamic Hungarian algorithm). • My remark: still very slow (can take months…)
Few Optimal Alignment algorithm • Optimizing aligned pair: • Appears in at least one optimal alignment. • The set of optimizing pairs: • Can be computed in at worst O(n4) time. • Can be easily parallelized. My remark: too slow…
Few Optimal Alignments - Analysis • Significance of aligned pair: • According to number of optimizing pairs per u. • If(u,v) were the only optimizing pair for u: every optimal alignment contains (u,v). I.e., (u,v) is highly significant. • Core alignment: • the set of all such special optimizing pairs. • Large core alignment means: stable alignment.
Measures of alignment quality (1) • Edge correctness (EC) – • percentage of edges in one graph that are aligned to edges in the other graph. To be able to measure the following measurements, must know the “true alignment” … • Node correctness (NC) – • percentage of nodes in one network that are correctly aligned to nodes in the other network • Interaction correctness (IC) – • percentage of interactions that are aligned correctly • IC is stricter than EC: • EC does not require that the alignment partners are the correct ones
Measures of alignment quality (2) • Usually the “true alignment” is not known • So, can measure just EC… • two alignments possibly can have similar ECs, where one alignment is “good” and the other is “bad” EC is not enough… • To uncover regions of similar topology: • the aligned edges must cluster together and form large and dense connected sub-graphs. • Common connected sub-graph (CCS): • connected sub-graph that appears in both networks • Good alignment has: • large and dense CCSs. • Large EC
Statistical Significance • Random alignment of real-world networks: • the probability of obtaining a given or better EC at random. • Null model of random alignment: • Random mapping g: E1 → V1× V2. • n1= |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|. • p = n2 (n2− 1)/2: the number of node pairs in G2 • EC = x%: the edge correctness of the given alignment • k = [m1 × x]: the number of aligned edges from G1to edges in G2. • P: • the probability of successfully aligning k or more edges by chance (the tail of the hypergeometric distribution): .
More statistical Significance Metrics • H-GRAAL’s alignment of random model networks: • Checks the significance of the alignment in compare to alignment of random networks: • Align two PPI networks, • align them with random networks, • compare results. • Biological Validation: • find the number of aligned protein pairs sharing a Gene Ontology (GO) term. • Compute its statistical significance. • Significance of functional enrichments: • Align metabolic networks of different species • generate phylogenetic trees based on H-GRAALs ECs. • Compute its statistical significance.
Results (1) • H-GRAAL always produces better alignments than GRAAL for all values of α. • using only degrees (α = 0) gives bad results. • So, graphlet-based signatures are far more valuable than a measure based on degree alone.
Results (2) • The largest common connected sub-graph in the alignment of the yeast and human PPI networks • consisting of 1,290 interactions amongst 317 proteins. • This network appears, in its entirety, in the PPI networks of both species.
Results (3) • Statistics of H-GRAAL’s core yeast-human alignment for α = 0.5. • The percentage of yeast proteins, out of 2,390 of them, that participate in n “optimizing pairs”. • Shows the quality of H-GRAAL!
Results (4) • Comparison of the phylogenetic trees for protists and fungies • H-GRAAL’s and GRAAL’s tree are slightly different from the sequence-based one. • Sequence-based trees are built based on: • multiple alignment of gene sequences • whole genome alignments.
Results (5) • Multiple alignments have few problems: • Can be misleading due to gene rearrangements, inversions, transpositions, and translocations (at the substring level) • Different species might have an unequal number of genes or genomes of vastly different lengths. • Whole genome alignments can be misleading: • Noncontiguous copies of a gene or non-decisive gene order. • The trees are built incrementally from smaller pieces that are “patched” together probabilistically probabilistic errors expected. • H-GRAAL’s and GRAAL’s have none of these. But • There are noise problems • Incompleteness of PPI networks. • No reason to believe that the sequence-based tree or GRAAL’s one should a priori be considered the correct one
Conclusions • Presented H-GRAAL algorithm for global alignment between networks • Presented different statistics to evaluate the quality of the alignment. • Experimented with different PPI networks, and not only PPI. • Showed that H-GRAAL is the best known global alignment algorithm. • H-GRAAL can have huge influence on researching biological networks!