370 likes | 634 Views
Querying Big Social Graphs. Incremental graph pattern matching Query preserving graph compression Graph pattern matching using views Top-k graph pattern matching Distributed graph pattern matching. 1. The complexity of graph pattern matching. Recall from the last lecture. 2.
E N D
Querying Big Social Graphs • Incremental graph pattern matching • Query preserving graph compression • Graph pattern matching using views • Top-k graph pattern matching • Distributed graph pattern matching QSX (LN 8) 1
The complexity of graph pattern matching Recall from the last lecture QSX (LN 8) 2
Real-life graphs are “big” • Graph pattern matching: • Input: Pattern Q, and data graph G, • Output: M(Q, G), the set of matches of Q in G Facebook : 1B users, 140B links Too costly • Assuming SSD (Solid State Drives) of 6G/s. How long is O(|G|)? • when G is of 1PB (1015B) • when G is of 1EB (1018B) 1.9 days 5.28 years • On graphs with millions of nodes and billions of edges? • NP-complete for subgraph isomorphism • cubic-time for bounded simulation • quadratic-time for simulation 3
To cope with the sheer size of social graphs • Graph pattern matching: • Input: Pattern Q, and data graph G, • Output: M(Q, G), the set of matches of Q in G How can we query big graphs? The cost of query processing: afunction f(|G|, |Q|) can’t reduce the lower bound of the computation • Reduce f? • Reduce |Q|? • Reduce |G|? • Incremental graph pattern matching • Query preserving graph compression • Graph pattern matching using views • Top-k graph pattern matching • Distributed graph pattern matching does not help much: |Q| is small anyway Yes! Make big data “small”! 4
Incremental graph pattern matching 5%/week in Web graphs • Real-life social graphs are dynamic – constantly change, ∆G • Re-compute M(Q, G⊕∆G)starting from scratch? • Changes ∆G are typically small Compute M(Q, G) once, and then incrementally maintain it Changes to the input Incremental graph pattern matching • Input: Q, G, M(Q, G), ∆G • Output: ∆M such that M(Q, G⊕∆G) = M(Q, G) ⊕∆M Old output New output Changes to the output When changes ∆G to the data graph G are small, typically so are the changes ∆M to the output M(Q, G⊕∆G) Recall incremental XML publishing Minimizing unnecessary recomputation 6
Complexity of incremental problems Incremental graph pattern matching • Input: Q, G, M(Q, G), ∆G • Output: ∆M such that M(Q, G⊕∆G) = M(Q, G) ⊕∆M Incremental algorithms? The cost of a batch algorithm: afunction of |G| and |Q|? • incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: AFF, characterizing ∆M The updating cost that isinherentto the incremental problem itself G. Ramalingam, Thomas W. Reps: On the Computational Complexity of Dynamic Graph Problems. TCS 158(1&2), 1996 The amount of workabsolutely necessary to perform for any incremental algorithm Bounded: the cost is expressible as f(|CHANGED|)? Optimal: in O(|CHANGED|)? 7 Complexity analysis in terms of the size of changes
The affected area * 1 (bounded) simulation edge-path relation 2 John, DB Ann, CTO Pat, DB 1 Mat, Bio Bill, Bio Q • Vr : the nodes in G matching pattern nodes in Q • Er: the paths in G matching edges in Q the result graph of Q in G⊕∆G the result graph of Q in G • Affected Area (AFF) • the difference between Gr and Gr’ • The size of changes in the output The complexity and bounded analyses of incremental matching • |CHANGED| = |∆G| + |AFF| Result graphs: Gr = (Vr, Er) for (bounded) simulation 8
Incremental graph pattern matching: An example Q CTO * 2 1 DB Insert e2 Bio 1 John, DB Mat, Bio Ann, CTO Insert e1 e5 e3 Insert e3 Bill, Bio Tom, Bio e4 Insert e4 e2 Ross, Med Pat, DB Don, CTO Insert e5 e1 ∆G G affected area Gr Ann, CTO John, CTO Pat, DB Dan, DB Bill, Bio Tom, Bio Mat, Bio 9 Comparing the cost of incremental matching with its batch counterpart
Incremental simulation matching in O(|AFF|) time Outperform its batch counterpart by 50% for changes up to 10% • Input: Q, G, Msim (Q, G), ∆G • Output: ∆M such that Msim (Q, G ⊕ ∆G) = Msim(Q, G) ⊕∆M • Updates: • Unit updates: single edge deletion or insertion • Batch updates: a sequence of edge deletions and insertions • Boundedness results • unbounded even for unit updates and general patterns • Optimal for • single-edge deletions and general patterns • single-edge insertions and DAG patterns 10
Incremental bounded simulation Negative: unbounded even for unit updates Path pattern: a graph pattern consisting of a single path both simulation and bounded simulation Is it really that bad? • Input: Q, G, Mbsim(Q, G), ∆G • Output: ∆M such that Mbsim (Q, G ⊕ ∆G) = Mbsim(Q, G) ⊕∆M • Boundedness result • unbounded even for unit updates and path patterns 11
Semi-bounded results • Semi-bounded: the cost is a PTME function f(|CHANGED|, |Q|) | Q| is small O(|∆G|(|Q||AFF| + |AFF|2)) time Independent of | G | • for batchupdates and general patterns Incremental matching via bounded simulation Outperform its batch counterpart by 30% for changes up to 10% Incremental simulation and incremental bounded simulation are both in 12
Incremental subgraph isomorphism not semi-bounded unless P = NP • Input: Q, G, M(Q, G), ∆G • Question: whether there exists a subgraph in G⊕∆G that is isomorphic to Q Neither bounded nor semi-bounded • Input: Q, G, Miso(Q, G), ∆G • Output: ∆M such that Miso (Q, G⊕∆G) = Miso(Q, G) ⊕∆M • Boundedness and complexity • Incremental matching via subgraph isomorphism is unbounded even for unit updates over DAG graphs for path patterns • Incremental subgraph isomorphism is NP-complete even when G is fixed 13
Query preserving graph compression R G Gc Q Q P Q( G) Q( Gc) The cost of a batch matching algorithm:f(|G|, |Q|) It is unlikely that we can lower its complexity, but can we reduce the size of its parameter |G|? Query preserving compression <R, P> for a class L of queries • For any graph G, Gc =R(G) • For any Q in L, Q( G ) = P(Q, Gc) Compressed graph Post-processing Compress graphs relative to a particular class of queries 15
What is new about query preserving compression? Query preserving compression <R, P> for a class L of queries • For any graph G, Gc =R(G) • For any Q in L, Q( G ) = P(Q, Gc) • Relative to a class L of queries of users’ choice • Better compression ratio: only information about L queries no need to decompress Gc • For any Q in L, Q(Gc) can be directly computed Any algorithms and indexing structures forG can be used for Gc In contrast to lossless compression, no need to restore the original graphG • Gc is computed once for all queries Q in L Incrementally maintained whether a node can reach another Reduction: 95% in average for reachability queries 16
Compression for bounded simulation Query preserving compression <R, P> for graph pattern matching • R(G) inO(|E| log (|V|)) time • P(Q, Gc): linear time in the size of Q( G ) • compression function R( ): • maximum bisimulation relation on the nodes of G • equivalence relation nodes in Gc denote equivalence classes • post-processing function P( ): • making use of the inverse of R( ) nodes in Q(Gc) are expanded to nodes in their equivalence classes Reduction: 57% in average for graph pattern matching 17
Compression for bounded simulation: example c1 c3 ck c2 fa1 fa2 fa3 R(G): computes equivalence classes msa1 msa2 MSAr msa1 msa2 R(G): constructs Gc with equivalence classes bsa1 bsa2 BSAr bsa1 bsa2 FAr’ fa1 P(Q,Gc): expanded to the nodes in their equivalence classes FAr fa2 fa3 … c1 c2 ck c3 Cr Cr’ G Gc 18
Incremental graph compression Gc is computed once for all queries Q in L • Boundedness and complexity • unbounded even for unit updates • in O( |AFF|2 + | Gc | ) time Subgraph isomorphism? No need to decompress Gc Compressed once and incrementally maintained Input: G, Gc = R(G), ∆G Output: ∆Gc such that R(G ⊕ ∆G) = R(G) ⊕∆Gc 19
Answering graph queries using views The cost of a matching algorithm:f(|G|, |Q|) View definitions: graph patterns can we compute Q(G) without accessing G, i.e., independent of |G|? Query answering using views: given a query Q in a language Land a set V views, find another query Q’such that • Q and Q’ are equivalent • Q’only accesses V(G) for any graph G, Q(G) =Q’(G) • Answering queries on big data: • Regardless of how big G is – the cost is “independent”of G • V(G)is often much smaller than G (4% -- 12% on real-life data) The complexity is no longer a function of |G| 21
When can queries be answered using views? Query answering using views: given a query Q in a language Land a set V views, find another query Q’such that • Q and Q’ are equivalent: for any graph G, Q(G) =Q’(G) • Q’only accesses V(G) Can Q be answered using a set V of views? efficient • A characterization: a sufficient and necessary condition • Containment checking: Q V NP-complete for relational conjunctive queries How expensive is it to determine whether Q V? • Quadratic-time in | Q | and |V | for simulation • Cubic-time for bounded simulation 22 View definitions
Pattern query containment: example PM PM e1 e2 View 1 PRG PRG PRG PRG DBA DBA DBA DBA e3 View 2 e4 Pattern query It takes 0.5 second to check containment of large cyclic patterns
The complexity of query answering • Input: Pattern Q, a set V views, and data graph G • Output: M(Q, G) quadratic time O( |V(G)| |Q| + |V(G)|2 ) • In contrast, • Graph simulation:O((|V| + | VQ |) (|E| + |EQ| ) • Bounded simulation: O(|V| |E| + |EQ| |V|2 + |VQ| |V|) V(G): much smaller than G Substantially outperform traditional matching methods, by 97% 24
Computing top-k matches Traditional graph pattern matching: compute M(Q, G) • It is expensive to compute when G is large • The result M(Q, G) is excessively large for the users to inspect – larger than G • 15% of social queries are to find matches of specific pattern nodes, rather than the entire set M(Q, G) for instance, recommendation Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: A top-ranked set of k matches of a designated node Early termination: return top-k matches without computing M(Q, G) 26
Graph pattern matching with output node Output node pm1 pm2 pm3 Matches of the output node Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) * Top-2 matches …… PM pmn PRG DB prg1 db1 prg2 db2 prg3 db3 Output: k nodes vs. M(Q, G) ST st1 st2 st3 st4 stm Pattern Q …… Input: graphG = (V, E, fA), patternQ = (VQ, EQ, fv, uo) Output: Mu(Q, G, uo) = { v | (uo, v) M(Q, G)} 27
Ranking match results: Relevance Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) pm1 pm2 pm3 PM * pmn Tok-2 relevant matches prg1 db1 prg2 db2 prg3 db3 PRG DB ST Pattern …… st1 st2 st3 st4 stm Top-k graph pattern matching: social impact 28
Ranking match results: Diversity Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) pm1 pm2 pm3 δd(pm1,pm2)=(m+5)/(m+6) δd(pm2,pm3)=3/(m+2) PM * pmn prg1 db1 prg2 db2 prg3 db3 PRG DB Top-2 diversified matches δd(pm1,pm3)=1 ST Pattern …… st1 st2 st3 st4 stm Diversified top-k graph pattern matching: social diversity 29
The complexity Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) quadratic time • Relevance alone:O((|V| + | Q |) (|E| + |V | ) • Diversification based on both relevance and diversity • NP-complete (decision problem) • APX-hard • O((|V| + | Q |) (|E| + |V | ) with approximation ratio 2 • Early termination: stop as soon as top-k matches are found without computing Mu(Q, G, uo) Improving traditional matching methods by 65% 30
Distributed graph pattern matching The cost of a batch matching algorithm:f(|G|, |Q|) reduce the parameter? manageable sizes Divide and conquer • partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi • upon receiving a query Q, • evaluate Q( Gi )in parallel • collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Social graphs are already geometrically distributed Network traffic and response time: Independent of |G| 32
Partial evaluation computef( x ) f( s, d ) • conduct the part of computation that depends only on s • generate a partial answer the part of known input yet unavailable input at each site,Gi as the known input a residual function • Partial evaluation in distributed query processing • evaluate Q( Gi )in parallel • collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gj as theyet unavailable input functions A TDD topic Partial evaluation: a promising approach 33
Open research issues • Querying large social graphs • Distributed graph pattern matching • Query preserving graph compression • Graph pattern matching using views • top-k graph pattern matching • Approximate and inexact algorithms • . . . Distributed matching with the same performance guarantees? subgraph isomorphism? A combination of all these Many issues need a full treatment QSX (LN 8) 34
More reading • W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB, 2014. • W. Fan, X. Wang, and Y. Wu. Answering graph pattern queries using views, ICDE, 2014. • W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013 (SIGMOD 2011). • W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. • W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012 (invited). • W. Fan, X. Wang, and Y. Wu. Performance Guarantees for Distributed Reachability Queries, VLDB, 2012. • W. Fan J. Li, S. Ma, N. Tang, and Y. Wu. Adding regular expressions to graph reachability and pattern queries, ICDE 2011. • W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010.