470 likes | 639 Views
Querying Social Networks. Graph pattern matching Incremental graph pattern matching Graph similarity. Querying Social Networks. Graph pattern matching Incremental graph pattern matching Graph similarity. Data in emerging applications: Graphs. Computer vision Biology Network traffic
E N D
Querying Social Networks • Graph pattern matching • Incremental graph pattern matching • Graph similarity QSX (LN 9)
Querying Social Networks • Graph pattern matching • Incremental graph pattern matching • Graph similarity QSX (LN 9)
Data in emerging applications: Graphs Computer vision Biology Network traffic Intelligence analysis Semantic Web Social networks AI Med DB Chem Gen Soc Eco QSX (LN 9)
Graph Pattern Matching How to define? • sub-graph isomorphism • graph simulation Widely employed in a variety of emerging real life applications • Given a pattern graph P and a data graph G , decide whether Gmatches P, and if so, find all the matches of P in G. • Applications • social queries, social matching • biology and chemistry network querying • key work search, proximity search, … QSX (LN 9) 4
Subgraph isomorphism Graph pattern matching: find all subgraphs of G that are isomorphic to P A A B B v1 v2 B E D D E P G A function: identical label matching, edge-to-edge relations A function f from the nodes of P to the nodes of G: For each node u in P, u and f(u) have the same label; There exists an edge (u, u’) in P if and only if there exists an edge (f(u), f(u’)) in G QSX (LN 9) 5
Graph Simulation A binary relation R on the nodes of P and the nodes of G: • For each node u in P and each node v in G,if (u, v) is in R, then u and v have the same label; • If there exists an edge (u, u’) in P and (u, v) is in R, then there exists an edge (v, v’) in G and (u’, v’) is in R Graph pattern matching: find the maximum simulation relation A A B B v1 v2 B Capable enough? E D D E P G A relation: identical label matching, edge-to-edge relations QSX (LN 9) 6
Graph Pattern Matching Identify all suspects in the drug ring B B A1 Am captured by a relation, not a function not allowed by bijection S AM W W 1 W 3 3 W W W FW (bounded) edge-to-path mapping W W Neither subgraph isomorphism nor graph simulation works Drug trafficking The quest for a new approach to graph pattern matching Find all matches of a pattern in a data graph QSX (LN 9)
Complexity NP-complete Exponential number of subgraphs PTIME Bounded by |Vp||V| • Real-life graph is typically large • Efficient algorithms are necessary • Simulation-based solution is promising Subgraph isomorphism is intractable, and has a large result • Subgraph isomorphism • Graph simulation QSX (LN 9)
Data Graphs AI Med (‘dept’=CS, ‘field’=AI) DB Gen Chem (‘dept’=CS, ‘field’=DB) Soc Eco (‘dept’=Bio, ‘field’=Gen) (‘dept’=Bio, ‘field’=Eco) Capturing data graphs in emerging applications • A data graph is a directed graph G = (V, E, fA) • fA(u) is a tuple (A1 = a1, ..., An = an) • attributes: label, keywords, blogs, comments … QSX (LN 9)
Pattern Graphs Med * 3 * Bounded 2 CS Bio Unbounded 3 2 fv(): ‘dept’=CS Soc Incorporating search conditions and bounds on hops • A pattern graph is defined asP = (Vp, Ep, fv, fe) • fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥ • a search condition • fe(u,u’):k or ∗, bound QSX (LN 9)
Bounded Graph Simulation In traditional graph simulation, (v, v’) is an edge AI Med Med * 3 * 2 CS Bio DB Gen Chem 3 S 2 No matches Soc Soc Eco A departure from traditional graph simulation • G=(V, E) matches P=(Vp, Ep) via bounded simulation, if there exists a binary relationS ⊆ Vp × V such that: • for each u∈ Vp, there exists v∈ V such that (u,v)∈ S • for each (u,v)∈ S, the attributes fA(v) satisfies the predicate fv(u) • each (u,u’) in Epis mapped to a bounded path from v to v’ in G, (u’,v’)∈ S QSX (LN 9)
Maximum Match and Result Graph But these are graph queries Med 1 Result graph: a graph representation of S 2 3 Gen 1 Med 3 DB * 3 2 * 2 Eco Bio CS 2 3 1 3 2 Soc Soc Result graph S Maximum match: if G matches P, then there is a unique maximum match QSX (LN 9) SQL queries return relations
Identify the Maximum Match • Algorithm: Match • input: a data graph G and a pattern graph P • output: the maximum match S cubic time decrease monotonically The algorithm is inO(|V|3) time In contrast to the intractability of subgraph isomorphism Main ideas: • Initiate the match set of each pattern node, and a distance matrix • Recursively remove nodes that cannot make a match • Return the maximum match S, or an empty set otherwise QSX (LN 9)
How it works AI Med Med * 3 * 2 CS Bio DB Gen Chem 3 2 Soc Soc Eco • step1: AI is removed from mat(CS), andpremv(Soc)=null • step2: nothing can be removed from mat(CS), andpremv(Bio)=null • step3: nothing can be removed from mat(CS), mat(Bio), premv(Med)=null • step4: nothing can be removed from mat(Med), premv(CS)=null • Return the maximum match Initialization QSX (LN 9)
Extension: Adding edge types… 15 strangers-nemeses strangers-allies friends-allies friends-nemeses Essembly Network Real life graphs have multiple edge types QSX (LN 9)
Querying Essembly network: an example 16 sn fa+ sa fa<=2 sa<=2 Biologists supporting Cloning fa fn fa<=2 sn fn Alice Doctors Against cloning fn P Essembly Network Pattern queries with multiple edge types QSX (LN 9)
Graph reachability queries Adding edge types Job=‘biologist’, sp=‘cloning’ fa<=2 fn Job=‘doctors’ Regular expressions to capture edge types • Real life graphs usually bear different edge types… • data graph G = (V, E, fA, , fC) • Reachability query (RQ) : (u1, u2, fu1, fu2, fe), where fe is a subclass of regular expression of: • F ::= c | c≤k | c+ | FF • Qr(G): a set of node pairs (v1, v2) such that there is a nonempty path from v1 to v2 , and the edge colors on the path match the pattern specified by fe. QSX (LN 9)
Graph pattern queries RQ and bounded simulation are special cases of PQ • Graph pattern queries PQ Qp =(Vp, Ep, fv, fe), where for each edge e=(u,u’), Qe=(u1, u2, fv(u), fv(u’), fe(e)) is an RQ. • Qp(G) is the maximum set {(e, Se)} • for any edge e1(u1,u2) in Qp, there is a v2 that (v1,v2) is in Se1 • for any edges e1(u1,u2) and e2(u2 ,u3), if (v1,v2) is in Se1, then there is a v3 such that (v2,v3) is in Se2 . • PQ vs. simulation and bounded simulation • search condition on query nodes • mapping edges to paths • constrain the edges on the path with a regular expression QSX (LN 9)
Reachability and graph pattern query: examples sn sa fa fn fa+ Job=‘biologist’, sp=‘cloning’ fa<=2 sa<=2 Job=‘biologist’, sp=‘cloning’ fa<=2 sn fa<=2 fn fn Id=‘Alice’ Job=‘doctors’ dsp=‘cloning’ Job=‘doctors’ fn Graph pattern matching is in cubic-time when edge types are incorporated QSX (LN 9)
Querying Social Networks • Graph pattern matching • Incremental graph pattern matching • Graph similarity QSX (LN 9)
The need for incremental graph pattern matching • Real-life graph is dynamic • Compute from scratch? • The changes are typically small These highlight the need for incremental pattern matching • Real-life graph could be large • Cubic-time may still be too costly QSX (LN 9)
Incremental Graph Pattern Matching • A pattern P, a graph G, the maximum match S, updates δ, find S’(P, G + δ) • Affected area |AFF|: the changes in the input and the output • An incremental problem is bounded if its complexity is a function of |AFF| determined only by the changes Taking the distance matrix M as an input • |AFF1|: the set of node pairs in G whose distance is changed • |AFF2|: the difference between S’ and S With performance guarantees • Unit updates: edge deletions, edge insertions • Batch updates: a sequence of edge deletions and insertions Minimizing unnecessary recomputation QSX (LN 9)
Incremental Algorithm for Unit Update • Algorithm: Match- • input: a graph G, a patternP, the maximum match S(P, G), the distance matrix M and an edge e to be deleted from G • output: the new maximum match S’(P, G – e) For unit insertion, w.r.t. DAG patterns and data graphs, Match+ runs in O(|AFF1||AFF2|2) O(|AFF1||AFF2|2) time for unit updates Main ideas: • Compute affected area AFF1 by incrementally deriving M’ • Identify matches in S that are directly affected by AFF1 • Recursively find all matches that are affected by AFF1 • Return S’ and M’, and constitute AFF2 QSX (LN 9)
How the Incremental Algorithm Works A A 2 2 n1 HR * HR SE HR,SE SE n2 2 1 n5 n3 DM,’golf’ DM,’golf’ DM,’golf’ n4 n6 • step1: identify AFF1: (n3,n5), (n3,n6), (n4, n5), (n4, n6) • step2: identify those in S that are affected by AFF1: (n4, n1) and (n3, n4) • step3: check the parent of n4, and remove n3 from mat(SE) Unit update can be efficiently handled incrementally Given P, G, the maximum match S, and an edge (n3, n5) to be removed QSX (LN 9)
Incremental Algorithms for Batch Updates • Naïve method: process updates in δ one by one • IncMatch: incrementally computes AFF1 and updates M by taking the entire δ as a batch • It is in O(|AFF1||AFF2|2)time, for DAG patterns and general data graphs. Batch update can be efficiently handled incrementally Given P, G, the maximum match S, and updates δ, find S’ QSX (LN 9)
Querying Social Networks • Graph pattern matching • Incremental graph pattern matching • Graph similarity QSX (LN 9)
Graph Matching: the problem How to define? Widely used in a variety of emerging real-life applications • Given graphs G1 and G2 , decide whether G1 matches G2 , i.e., whether G1 is “similar to” G2 • Applications • Web mirror/Web site classification • Complex object identification • Plagiarism detection • Key work search, proximity search, Web service composition… QSX (LN 9)
Graph similarity metrics: the state of the art Suffices in real-world? Identical label matching, edge-to-edge mappings/relations Structural-based metrics • (Sub)-graph homomorphism • Subgraph isomorphism • Maximum common subgraph • Edit distance • Graph simulation QSX (LN 9)
Web site matching: real life application edge-to-path mappings A.Home B.Index audio sports digital books books abooks albums categories CDs textbooks DVDs booksets G1 features genres arts school audio books G2 albums Graph homomorphism (subgraph isomorphism) is too restrictive! QSX (LN 9)
Basic notations B.index B.index book sports digital book sports digital A.home A.home categories bookset CD DVD categories bookset CD DVD books audio books audio textbook arts school audiobooks features genres arts school audiobooks features genres textbook album album abook albums abook albums • G = (V, E, L) , labeled directed graph • Similarity matrix M over G1 and G2 ,a matrix of size |V1||V2|, with M(u,v) the similarity score of node u and v. • Similarity threshold ξ QSX (LN 9) Objective: to capture semantic similarity
P-Homomorphism A A A B.index A.home P-hom ? book sports digital D books audio A B A B C categories bookset CD DVD B C B C C textbook album E C G1 G2 arts school audiobooks features genres abook D D D albums G4 G3 Graph homomorphism is a special case of P-homomorphism • P-homomorphism from G1 to G2: a total mapping from V1 to V2 • preserves node similarity (w.r.t a similarity matrix M and threshold ξ) • map edges to nonempty paths • P-homomorphism v.s graph homomorphism • node similarity v.s label equality • edge-to-path mapping v.s edge-to-edge mapping QSX (LN 9)
1-1 P-Homomorphism C B.index A.home A A A 1-1 P-hom ? book sports digital books audio D categories bookset CD DVD A textbook album B C B C C A arts school audiobooks features genres abook G1 G2 B B B albums v1 v2 D E Subgraph isomorphism is a special case of 1-1 P-homomorphism D E • G1 is 1-1 P-homomorphism to G2 if there exists a 1-1 (injective) P-homomorphism from G1 to G2. • distinct nodes in V1 have distinct matches in V2 • 1-1 P-homomorphism v.s subgraph isomorphism • node similarity v.s label equality • 1-1 edge-to-path mapping v.s bijective edge-to-edge mapping QSX (LN 9) G5 G6
Metrics for measuring graph similarity MCS is a special case of CPH1-1 Similarity metric based on the maximum number of node matches Not every node in one graph can find a P-hom match in the other graph … • Maximum cardinality • The cardinality of p-hom mapping from a subgraph G1’ = (V1’, E1’,L1’) of G1 to G2: • Card(ρ) = |V1’|/|V1| • The maximum cardinality problem CPH(resp. CPH1-1): • Input: two graphs G1 and G2 • Output: the P-hom (resp. 1-1 P-hom)mapping ρ with the maximum Card(ρ). QSX (LN 9)
Example for CPH1-1 C A A B B B v1 v2 D E D E Maximum cardinality metric : 4/5 = 0.8 QSX (LN 9) G5 G6
Metrics for measuring graph similarity (cont.) Favor important nodes Similarity metric based on overall weighted similarity of nodes • Overall similarity • The overall similarity of p-hom mapping from a subgraph G1’ of G1 to G2: • Sim(ρ) = ∑(w(v1’) * M(v1’, ρ(v1’)) / ∑w(v), v1’ ∈V1’, v ∈ V • Maximum overall similarity SPH(resp. SPH1-1): • Input: two graphs G1 and G2 • Output: the P-hom (resp. 1-1 P-hom)mapping ρ with the maximum Sim(ρ) . QSX (LN 9)
Example for CPH and SPH C 0.6 1.0 6 A A B B B v1 v2 D E D E Maximum overall similarity metric : (1*1+6*1)/8 = 0.7 QSX (LN 9) G5 G6
Complexity results - Intractability Approximation algorithms? P-Hom and 1-1 P-Hom are intractable. • P-Hom and 1-1 P-Hom are NP-complete. • NP-hard when both G1 and G2 are DAGs • NP-hard for 1-1 P-Hom when G1 is a tree and G2 is a DAG. • reduction from 3SAT and X3C • The decision problem of CPH, CPH1-1,SPH, SPH1-1 are NP-complete. • reduction from P-Hom and 1-1 P-Hom • NP-hard for DAGs QSX (LN 9)
Complexity results – Approximation Hardness Problem A Problem B f x f(x) g, α SB f(x), ε g(x, SB f(x), ε), α(ε) Solution of f(X) Solution of X P-Hom and 1-1 P-Hom are hard to approximate Unless P = NP, CPH, CPH1-1, SPH, SPH1-1 are not approximable within O(1/n1-ε) for any constant ε, where n is the node number of input graphs. Approximation factor preserving reduction (AFP-reduction) from the maximum weighted independent set problem (MWIS) QSX (LN 9)
Approximation Algorithms Approximation bound? P-Hom can be solved with a provable performance guarantee Given two graphs G1 = (V1, E1, L1) and G2 = (V2, E2, L2), CPH, CPH1-1, SPH, SPH1-1 are all approximable within O(log2 (|V1||V2|)/ (|V1||V2|)) • AFP reductions to the MWIS problem QSX (LN 9)
Approximation algorithm for CPH Optimization: avoid operations on the product graph provable performance guarantee Algorithm compMaxCard(G1,G2,M, ξ) • Input: two graphs G1 = (V1, E1, L1)and G2 = (V2, E2, L2), a similarity matrix M, and a similarity threshold ξ • Output: a P-hom mapping from subgraph of G1 to G2 • Key ideas • initialize matching list for each node in G1 • compute the transitive closure of G2 • starting from a match pair, recursively choose and include new matches to the match set until it can no longer be extended, via a greedy strategy. • Complexity: O(| V1 |3| V2 |2 + | V1 || E1 || V2 |3) QSX (LN 9)
Algorithm compMaxCard: running example B.index A.home book sports digital books audio categories bookset CD DVD textbook album school art audiobooks features genres abook G1 G2 albums QSX (LN 9)
Algorithm compMaxCard: running example (cont.) candidate set w.r.t M and ξ B.index A.home book sports digital books audio categories bookset CD DVD textbook album school art audiobooks features genres abook G1 G2 albums Step1: Initialize matching list for each node in G1 QSX (LN 9)
Algorithm compMaxCard: running example (cont.) B.index A.home book sports digital books audio categories bookset bookset CD DVD textbook album school art audiobooks features genres abook G1 G2 albums Step2: Pick a node and select a pair of match QSX (LN 9)
Algorithm compMaxCard: running example (cont.) B.index A.home book sports digital books audio categories bookset CD DVD textbook album school art audiobooks features genres abook G1 G2 albums Step3: recursively expanding matches categories textbook QSX (LN 9) school
Algorithm compMaxCard: running example (cont.) B.index A.home book sports digital books audio categories bookset CD DVD textbook album school art features genres audiobooks audiobooks abook abook G1 G2 albums Step3: recursively expanding matches QSX (LN 9)
Summary and review • What is graph pattern matching? How is it defined? • What is sub-graph isomorphism? Graph simulation? • What is bounded graph simulation? Why? • What is the complexity of bounded simulation? sub-graph isomorphism? • Why do we need incremental graph pattern matching? • List five different measures of graph similarity • What is P-homomorphism? 1-1 P-homomorphism? Why? • What is the maximum cardinality problem? The maximum overall similarity problem? Their complexity? Approximation? Homework: Prepare your final project report QSX (LN 9)