340 likes | 456 Views
Keyword Search on External Memory Data Graphs. Authors: Bhavana Bharat Dalvi , Meghana Kshirsagar , S. Sudarshan. Presented By: Aruna. Outline. Introduction Modeling Graph Model 2-stage Graph Model Multi-granular Graph Representation Algorithms Iterative Expansion Search
E N D
Keyword Search on External Memory Data Graphs Authors: Bhavana Bharat Dalvi, MeghanaKshirsagar, S. Sudarshan Presented By: Aruna
Outline • Introduction • Modeling • Graph Model • 2-stage Graph Model • Multi-granular Graph Representation • Algorithms • Iterative Expansion Search • Incremental Expansion Search • Experiments • Conclusion
Keyword Search • Keyword search • A very simple and easy-to-use mechanism to get information from databases. • Keyword search queries • A set of keywords. • Allows users to find interconnected tuple structures containing the given keywords in an relational database. • Combines data from multiple data sources.
Keyword Search on Graph Data (1/2) • Query result • Rooted trees that connect nodes matching the keywords. • Keyword searches – ambiguous and query results may be irrelevant to a user • Ranking function • Top-k answers to keyword query.
Keyword Search on Graph Data (2/2) • Keyword search on databases • Information may split across the tables/tuples due to normalization. • Use of artificial documents. • Use of data graphs in the absence of schema. • Graph Data Model: • Lowest common denominator. • Integrates data from multiple sources from different schemas. • Enables novel systems for heterogeneous data integration and search. • Query result • A subtree where no node or edge can be removed without losing keyword matches. • Most of the previous work assumes graph fits in memory.
External memory data graph (1/2) • Problem with in-memory algorithms - if graph size is more than memory. • Solutions : • Virtual memory • Significant I/O cost • Thrashing • SQL • For relational data only • Not good for top-k query answer generation
External memory data graph (2/2) • Goal of the paper: • Use a compressed graph representation to reduce IO. • Graphs which are larger than memory. • Solution uses • Multi-granular graph • Two approaches • Iterative approach • Incremental approach
Graph Model (1/2) • Nodes: Every node has an associated set of keywords, with weights or prestige. • Influences the rank of answers containing the node. • Edges: directed and weighted. • Keywordquery : a set of terms ki, i=i….n. • Answertree: a set-of-paths model, with one path per keyword. • Each path (root to a node) contains the keyword.
Graph Model (2/2) • Node score : sum of the leaf/root node weights. • Edge score of an answer: sum of the path lengths. • Answer score : a function of the node score and the edge score of the answer tree.
Keyword Search Steps to generate top-k answers: • Looking up an inverted keyword index to get the node-ids of nodes. • Keyword nodes • Use of a graph search algorithm to find out trees connecting the keyword nodes found above. • Finds rooted answer trees, which should be generated in ranked order.
Supernode Graph • Clustering nodes in the full graph into supernodes, with superedges. SuperEdges Edge weights: wt(S1 → S2): min{wt(i → j): i S1, j S2} Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
2-Phase Search (1/2) • First-Attempt Algorithm: • Phase 1 : • Search on supernode graph to get top-k results (containing supernodes) • Using any search algorithm • Expand all supernodes from supernode results. • Phase 2 : • Search on this expanded component of graph to get final top-k results. Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
2-Phase Search (2/2) Top-k on expanded component may not be top-k on full graph. Experiments show poor recall.
Multi-granular (MG) graphs (1/3) • Combines a condensed version of the graph (the “supernode graph”) • Always memory resident. • Supernode graph: • Clustering nodes in the full graph into supernodes, with superedges. • All information about the part of the full graph, currently available in memory.
Multi-granular (MG) graphs (2/3) Node numbering scheme = supernode.innernode
Multi-granular (MG) graphs (3/3) • Edge-weights: • S1 S2: • Min {edge-weight n1 n2 | n1 S1 and n2 S2} • S i : • Min {edge-weight s i | s S} • I I: • Edge weight is same as in original graph. • Supernode answer: • Answer containing supernodes if we execute search on the MG graph. • Pure answer: • Answer that does not contain any supernodes.
Iterative Expansion Search (1/3) • Input : a MG graph. • Output : top k pure results. • Iterative search on MG graph • Repeat • Search on current MG graph using any search algorithm, to find top results. • Expand super nodes in top results. • Until top k answers are all pure.
Iterative Expansion Search (2/3) Yes No Output Expandsupernodes in top answers Explore (generate top-k answers on current MG graph, using any in-memory search method) top-k answers pure? Edges in top-k answers Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Iterative Expansion Search (3/3) • Guarantees finding top- k answers • Very good IO efficiency compared to search using virtual memory. • Nodes expanded above never evicted from “virtual memory” cache. • Expanded nodes retain in logical MG graph, re-fetch as required. • Can cause thrashing. • But high CPU cost due to repeated work.
Incremental Expansion Search • Motivation : • Repeated restarts of search in iterative search. • Basic idea: • Search on MG graph • Expand supernode(s) in top answer. • Unlike iterative search • Update the state of the search algorithm when a supernode is expanded, and • Continue search instead of restarting. • Run search algorithm until top k answers are all pure. • State update depends on search algorithm. • Use of backward expanding search. Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Backward Expanding Search • Based on Dijkstra’s single-source shortest path algorithm. • One shortest path search iterator per keyword. • Runs n copies of this algorithm concurrently. • Explored nodes: nodes for which shortest path already found. • Fringe nodes: unexplored nodes adjacent to explored nodes. • SPI tree: shortest path iterator tree • Tree containing explored and fringe nodes. • Edge uv if (current) shortest path from u to keyword passes through v.
Incremental backward search • Backward search run on multi-granular graph • Algorithm: • Repeat • Find next best answer on current multi-granular graph. • If answer has supernodes • Expand supernode(s) • Update the state of backward search, i.e. all SPI trees, to reflect state change of multi-granular graph due to expansion • Until top-k answers on current multi-granular graph are “pure” answers Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Incremental Search (1/3) SPI tree for k1 Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Incremental Search (2/3) Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Incremental Search (3/3) Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
State Update on Supernode Expansion • Affected nodes get detached. • Inner-nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1. • Affected nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1.
Effects of supernode expansion • Differences from Dijkstra's shortest-path algorithm: For Explored nodes: • Path-costs of explored nodes may increase. • Explored nodes may become fringe nodes. For Fringe nodes: • Incremental Expansion: Path-costs may increase or decrease. • Invariant • SPI trees reflect shortest paths for explored nodes in current multi-granular graph. • Theorem: Incremental backward expanding search generates correct top-k answers.
Heuristics • Thrashing Control : • Stop supernode expansion on cache full. • Use only parts of the graph already expanded for further search. • Intra-supernode edge weight • Heuristics can affect recall • Recall at or close to 100% for relevant answers, with heuristics, in the experiments.
Experimental Setup • Clustering algorithm to create supernodes • Experiments use Edge prioritized BFS. • Ongoing work: develop better clustering techniques • All experiments done on cold cache Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
External memory search: performance • Supernode graph very effective at minimizing IO • Cache misses with incremental often less than no. of nodes matching keywords. • Iterative algorithm • High CPU cost. • VM (backward search with cache as virtual memory) has high IO cost. • Use same clustering as for supernode graph. • Fetch cluster into cache whenever a node is accessed. • Evicting LRU cluster if required. • Search code unaware of clustering/caching. • Gets “Virtual Memory” view. • Incremental combines low IO cost with low CPU cost.
Query Execution Time (top 10 results) Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Cache Misses for Different Cache Sizes All VM All Incr. Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Query Execution Time (Last Relevant Result) Source: http://www.cse.iitb.ac.in/~sudarsha/pubs.html
Conclusions Graph summarization coupled with a multi-granular graph representation shows promise for external memory graph search. Ongoing/Future work Applications in distributed memory graph search. Improved clustering techniques. Extending Incremental to bidirectional search and other graph search algorithms. Testing on really large graphs.