Optimal Network Alignment with Graphlet Degree Vectors

Optimal Network Alignment with Graphlet Degree Vectors TijanaMilenković(Department of Computing, Imperial College London && Department of Computer Science, University of California) WengLeong Ng (Department of Computer Science, University of California), Wayne Hayes(Department of Computer Science, University of California && Department of Mathematics, Imperial College London) NatašaPržulj (Department of Computing, Imperial College London) Cancer Informatics 2010 Presented by: LilaShnaiderman

Motivation • Lately, advances in experimental techniques: • yeast two-hybrid assay, • Mass spectrometry of purified complexes, • genome-wide chromatin immunoprecipitation, • etc. • So, increasing amounts of biological network data becoming available! • Comparative analyses of biological networks have as large an impact as comparative genomics on: • understanding of biology • Evolution • disease • So, meaningful network comparisons across species becomes one of the foremost problems in evolutionary and systems biology!!!

Background • Subgraphisomorphism problem: • Is one graph exists as an exact subgraph of another graph. • NP-complete complexity • So, network comparisons are computationally infeasible… • Network alignment: • The most common network comparison method. • Is more general problem: • Find the best way to “fit” a graph into another graph (not an exact subgraph) • Unclear: • how to guide the alignment process • how to measure the “goodness” of an inexact fit • So, heuristic strategies must be sought

Background – alignment types • Local alignment: • The majority of existing methods. • match a small sub network from one network to one or more sub networks in another network. • Can be ambiguous… • Global alignment: • Measures the overall similarity between two networks. • Aligns every node in the smaller network to exactly one node in the larger network. • most existing methods incorporate some a priori information external to network topology • like protein sequence similarities in PPIs networks, etc. • Best known global networkalignment algorithm based solely on network topology: • GRAphALigner (GRAAL): uses a heuristic search strategy to quickly find approximate alignments

Current solution: H-GRAAL • Hungarian-algorithm basedGRAAL • More expensive • Guaranteed to find optimal alignments relative to any fixed, deterministic cost function. • Relies solely and explicitlyon a strong and direct measure of network topological similarity. • Applicable to any type of networks • Allows to transfer the knowledge between aligned networks.

Graphlet degree vectors (1) • A small connected induced sub graph of a larger network. 4 1 6 8 0 3 5 7 2 G1 G0 G2 G4 G5 G3 13 10 14 12 11 9 G6 G7 G8

Graphlet degree vectors (2) • Graphlet degrees vector of node V: counts the number of different graphletsthat the node touches (for all graphletson 2 to 5 nodes). 0 v v v v

Graphlet degree vectors (3) 1 2 v v orbit

Graphlet degree vectors (4) 1 2 v v v v v

Graphlet degree vectors (4) 4 3 ? 5 v

Graphlet degree vectors (5) 4 5 v v v v

Graphlet degree vectors (6) 6 8 7 v v v

Graphlet degree vectors (7) What is the degree of node V(according to the vector)? v v The signature of node V 10 There are 73 different orbits across all 2-5-node graphlets 11 9

Degree Vector - Signature • Many real-world Networks: • Have a small-world nature • So, degree Vector is an effective measure: • Looks at network distance of 4 around a node • Captures a large portion of network topology • Thus, comparing two signatures: • Highly constraining measure of local topological similarity between nodes.

Signature similarity • For uG, ui: = • the ithcoordinate of its signature vector. • Distance: • wiis the weight of orbit i. • Accounts for dependencies between orbits • higher weights to orbits that are not affected by many other orbits • Questions: • Why log? • Why “+1”?

Distance and Similarity • Total Distance: • in (0,1) • O means: u,v identical • Similarity: S(u,v) = 1-D(u,v)

H-GRAAL algorithm-definitions • G1 and G2 are networks: • |V(G1)|<|V(G2)| • Alignment of G1 to G2: • set of ordered pairs (u,v), u ∈ V (G1) and v ∈ V (G2) • no two ordered pairs share the same G1-node or the same G2-node. • Each pair called aligned pair. • Maximum alignment: • Every G1-node is in some aligned pair • From now on: alignment=maximum alignment

H-GRAAL algorithm • H-GRAAL: • Hungarian-algorithm-based GRAphAligner • Produces an alignment: • of minimum total cost between networks • total cost: summed over all aligned pairs • aligned pair cost: based on signature similarity • The cost of aligning u and v: • favors alignment of the densest parts of the networks; • Reduced as the degrees of both nodes increase: higher degree nodes with similar signatures provide a tighter constraint • α ∈ [0, 1]: weighs the cost-function contributions of the node signature similarity between u and v • 1 − α: weights the contribution of nodes degrees.

Alignment Cost • Cost=0: a pair of topologically identical nodes u and v • Cost close to 2: a pair of topologically very different nodes. • Any problem with this formula? • T(u,v) for most nodes is very low: • As, there is small number of hubs (highly-linked nodes), • So max_deg(G1) and max_deg(G2) are much larger than deg(u) and deg(v).

Hungarian Algorithm • solves the assignment problem in polynomial time: • Create two bipartite graphs V(G1),V(G2). • Edge (u,v) from V(G1) to V(G2): labeled with the node alignment cost. • Find perfect match between them (with minimal cost). • More than one optimal alignment is possible: • the particular found alignment is highly dependent on the implementation details of the underlying Hungarian algorithm. • For example: the order of presenting the nodes to the algorithm

Finding Few Optimal Alignment • Can learn about all possible optimal matchings. • Make H-GRAAL to give more alignments: • “Remove” (u,v): raise the alignment cost of a node-pair (u,v) in A0to +∞ • Run H-GRAAL again • Found alignment with higher cost than A0, “Remove” different edge. • After trying to “remove” all edges, if not found alignment with optimal cost, no more optimal alignments exist. • This process has too high complexity… • O(|V(G1)|3x||E(G1)|) • There exist a fix O(|V(G1)|2x||E(G1)|) (based on dynamic Hungarian algorithm). • My remark: still very slow (can take months…)

Few Optimal Alignment algorithm • Optimizing aligned pair: • Appears in at least one optimal alignment. • The set of optimizing pairs: • Can be computed in at worst O(n4) time. • Can be easily parallelized. My remark: too slow…

Few Optimal Alignments - Analysis • Significance of aligned pair: • According to number of optimizing pairs per u. • If(u,v) were the only optimizing pair for u: every optimal alignment contains (u,v). I.e., (u,v) is highly significant. • Core alignment: • the set of all such special optimizing pairs. • Large core alignment means: stable alignment.

Measures of alignment quality (1) • Edge correctness (EC) – • percentage of edges in one graph that are aligned to edges in the other graph. To be able to measure the following measurements, must know the “true alignment” … • Node correctness (NC) – • percentage of nodes in one network that are correctly aligned to nodes in the other network • Interaction correctness (IC) – • percentage of interactions that are aligned correctly • IC is stricter than EC: • EC does not require that the alignment partners are the correct ones

Measures of alignment quality (2) • Usually the “true alignment” is not known • So, can measure just EC… • two alignments possibly can have similar ECs, where one alignment is “good” and the other is “bad”  EC is not enough… • To uncover regions of similar topology: • the aligned edges must cluster together and form large and dense connected sub-graphs. • Common connected sub-graph (CCS): • connected sub-graph that appears in both networks • Good alignment has: • large and dense CCSs. • Large EC

Statistical Significance • Random alignment of real-world networks: • the probability of obtaining a given or better EC at random. • Null model of random alignment: • Random mapping g: E1 → V1× V2. • n1= |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|. • p = n2 (n2− 1)/2: the number of node pairs in G2 • EC = x%: the edge correctness of the given alignment • k = [m1 × x]: the number of aligned edges from G1to edges in G2. • P: • the probability of successfully aligning k or more edges by chance (the tail of the hypergeometric distribution): .

More statistical Significance Metrics • H-GRAAL’s alignment of random model networks: • Checks the significance of the alignment in compare to alignment of random networks: • Align two PPI networks, • align them with random networks, • compare results. • Biological Validation: • find the number of aligned protein pairs sharing a Gene Ontology (GO) term. • Compute its statistical significance. • Significance of functional enrichments: • Align metabolic networks of different species • generate phylogenetic trees based on H-GRAALs ECs. • Compute its statistical significance.

Results (1) • H-GRAAL always produces better alignments than GRAAL for all values of α. • using only degrees (α = 0) gives bad results. • So, graphlet-based signatures are far more valuable than a measure based on degree alone.

Results (2) • The largest common connected sub-graph in the alignment of the yeast and human PPI networks • consisting of 1,290 interactions amongst 317 proteins. • This network appears, in its entirety, in the PPI networks of both species.

Results (3) • Statistics of H-GRAAL’s core yeast-human alignment for α = 0.5. • The percentage of yeast proteins, out of 2,390 of them, that participate in n “optimizing pairs”. • Shows the quality of H-GRAAL!

Results (4) • Comparison of the phylogenetic trees for protists and fungies • H-GRAAL’s and GRAAL’s tree are slightly different from the sequence-based one. • Sequence-based trees are built based on: • multiple alignment of gene sequences • whole genome alignments.

Results (5) • Multiple alignments have few problems: • Can be misleading due to gene rearrangements, inversions, transpositions, and translocations (at the substring level) • Different species might have an unequal number of genes or genomes of vastly different lengths. • Whole genome alignments can be misleading: • Noncontiguous copies of a gene or non-decisive gene order. • The trees are built incrementally from smaller pieces that are “patched” together probabilistically probabilistic errors expected. • H-GRAAL’s and GRAAL’s have none of these. But • There are noise problems • Incompleteness of PPI networks. • No reason to believe that the sequence-based tree or GRAAL’s one should a priori be considered the correct one 

Conclusions • Presented H-GRAAL algorithm for global alignment between networks • Presented different statistics to evaluate the quality of the alignment. • Experimented with different PPI networks, and not only PPI. • Showed that H-GRAAL is the best known global alignment algorithm. • H-GRAAL can have huge influence on researching biological networks!

Thank you for your attention!

Optimal Network Alignment with Graphlet Degree Vectors

Optimal Network Alignment with Graphlet Degree Vectors

Presentation Transcript

Mathematics with vectors

Lecture 2. Optimal Sequence Alignment

Optimal Sum of Pairs Multiple Sequence Alignment

Optimal Content Delivery with Network Coding

Physical Layer Network Coding with Signal Alignment

Multiplication with Vectors

Optimal Degree Distribution for LT Codes with Small Message Length

Working with Vectors

Fun with Vectors

A fast Prunning Algorithm for optimal Sequence Alignment

Degree bounded Network Design

Network Design with Degree Constraints

PPI Network Alignment

Vectors - Doing basic physics with vectors Contents: Scalars times vectors Whiteboards

Near-Optimal Network Design With Selfish Agents

Koorde: A Simple Degree Optimal DHT

Degree–Optimal Routing for P2P Systems

Degree-Optimal Deterministic Routing for P2P Systems

The Optimal Metabolic Network Identification

Lecture 2. Optimal Sequence Alignment

Near-Optimal Network Design with Selfish Agents

Vectors of Vectors