450 likes | 597 Views
ICDE 2014, Chicago, USA. A General Algorithm for Subtree Similarity-Search. The Setting. Huge Labeled Tree Data. Arises in computational biology, image analysis, automatic theorem proving, compiler optimization XML databases. Subtree Similarity-Search .
E N D
ICDE 2014, Chicago, USA A General Algorithm for Subtree Similarity-Search
The Setting Huge Labeled Tree Data • Arises in • computational biology, • image analysis, • automatic theorem proving, • compiler optimization • XML databases
Subtree Similarity-Search • Goal: Given a (small) tree Q and a number k, find the ksubtreesS of T most similar to Q Query tree Q Database tree T Top-k subtrees of T, most similar to Q n nodes⇨ n subtrees
Subtree Similarity-Search • Goal: Given a (small) tree Q and a number k, find the ksubtreesS of Tmost similar to Q • Similarity: defined using a function that takes two trees and returns a real value Query tree Q Database tree T Top-k subtrees of T, most similar to Q n nodes⇨ n subtrees
The Bottom Line • An algorithm for subtree similarity-search • Compatible with a wide family of tree distance functions • Runtime is linear • (Depending on the distance function used; see paper for exact analysis) • Experimental results show near-invariance to query size and number of results fetched
Defining Distance • We introduce profile distance functions for determining similarity among two given trees • Several previously proposed distance measures can be shown to be profile distance functions: • pq-gram distance (Augsten et. al.) • Windowed pq-gram distance (Augsten et. al.) • Binary branch distance (Yang et. al.) • Other multiset-based distance measures
Profile Distance Functions • Main idea: • Associate each tree T with a multisetof small objects that represent the tree structure and contents • Use a multiset comparison method to determine similarity between two trees
Profile Distance Functions Compare the multisets Summarize the interesting features of each tree using a multiset Distance value between the two trees
Profile Distance: A Simple Example “cluck”, “meow”, “meow”, “ribbit”, “woof” “meow” “purr” Compare the multisets.For example: Dice coefficient Summarize the interesting features of each tree using a multiset.For example: take bags of the tree labels “woof” “ribbit” “meow” “cluck” “meow” “cluck” “purr” “cluck”, “meow”,“purr”,“purr”
Profile Distance: pq-grams (Augsten et al.) … and many more “meow” “purr” Summarize the interesting features of each tree using a multiset.For example: pq-grams “woof” Compare the multisets.For example: NormalizedDice for multisets(Augsten et al.) “ribbit” “meow” “cluck” “ribbit” “meow” “meow” “cluck” “purr” This profile function pays respect to the tree’s structure as well as its content! “cluck” “ribbit” … etc. * * * “meow” “cluck” *
Profile Distance Functions • Main idea: • Associate each tree T with a multisetof small objects that represent the tree structure and contents • Use a multiset comparison method to determine similarity between two trees Actually, multiset for tree is determined by multisets associated with nodes Comparison functions will be based on intersection, union and sizes of multisets
Multisets Associated with Trees • Each node u is associated with two multisets: • : Contains elements that describe the subtree rooted at u • : Contains elements that describe the node u and its surroundings • A tree T, rooted at node r, is then associated with the multiset:
Example: Subtree Rooted In Node r Take the localmultiset from the root node Take the globalmultiset from non-root nodes … … r v1 v3 v2
Subtree Similarity Search A friendly reminder - Our mission: find the top-k subtrees of a tree T most similar to a query tree Q Query tree Q Database tree T This problem can trivially be solved in polynomial time The challenge: huge size of the data, and efficiently computing distances for all subtrees Top-k subtrees of T, most similar to Q
Subtree Similarity Search • Our algorithm’s basic strategy, given a number k, a query Q, and a tree T: • Go over T in post-order: • Calculate , for the subtree S rooted in the current node of T • Derive a distance value between Q and S • If S is one of the top-k subtrees we’ve seen, keep it in the results set
Calculating The Multiset Unions • Note: • Using the following formula, calculating the multiset size for each subtree S while iterating over T in post-order is easy:
Calculating The Multiset Intersections • Notation: is the number of times x appears in A • We sum over each x exactly once, even if it appears several times in the multisets • Suppose we want to calculate the size of the multiset intersection between A={α,α,α,β} and B={α,α,β,γ}
Calculating The Multiset Intersections • We begin with describing a simple algorithm for calculating the intersection sizes • This method is used within the DynamicSearch algorithm in the paper • Later, we will describe an improved algorithm • This improved approach is what we use in theProfileSimSearch algorithm in the paper
Multiset Intersections; Simple Version • We want to find the intersection size for each subtree S • Q always stays constant, so we calculate the multiset once • Any element contributes 0 to this sum, so we will only calculate for
Multiset Intersections; Simple Version • For each distinct , define a queue • This queue initially contains elements, all of which are null placeholders • For example, if ={a,a,a,b}, we have two queues:
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into null null null
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into null null
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into null
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into
Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: • Pop an element from , , and, • Insert v into
Multiset Intersections; Simple Version • In , we have: … v4 v9 … v8 v1 v3 Current iteration’s node in T v5 v6 v7 v2 A prefix of nulls and nodes from outside the current subtree A suffix of the nodes from the current subtree that have x in their global profile
Multiset Intersections; Simple Version • The length of the queue is always exactly • We can count the size of the suffix and prefix in order to obtain the intersection size (with respect to x), • Note: “local” multiset elements can fit in any slot of the prefix and contribute to the intersection size. We use this fact to account for the local multiset of the current node.
Is that all? The tree T is huge! Runtime of the simple algorithm is too high.
Making it Scalable • By careful book-keeping, we can avoid the need to count the size of each queue suffix • This reduces the runtime from quadratic to linear • Calculating the intersection with localmultiset elements is still needed • But, the runtime of this operation is bounded by the local multiset sizes, so overall linear in the input size
Calculating the Suffix Size On-The-Fly 1st attempt: • Each node in T keeps a counter, initialized to 0 • However, we’ll never use more than O(height(T)) memory • During the post-order iteration over T: • Increase counter(v) whenever v is enqueued in some queue • At the end of the iteration over v, add counter(v) to counter(v.parent) This is not good enough!What happens when a node is evicted from the queue?
Calculating the Suffix Size On-The-Fly Fixed: • Each node in T keeps a counter, initialized to 0 • However, we’ll never use more than O(height(T)) memory • During the post-order iteration over T: • Increase counter(v) whenever v is enqueued in some queue • At the end of the iteration over v, add counter(v) to counter(v.parent) • Whenever a node u is evicted from a queue and node v is inserted instead, decrement counter(LCA(u,v)) counter(w) contains the size of the suffix during the iteration over w
Calculating the Suffix Size On-The-Fly • The queue contains the last nodes that we’ve seen, to which x was associated with • Each x can’t contribute more than to the intersection size w w is the lowest common ancestor (LCA) of u,v dequeue u decrement counter(w) u v enqueuev Queue length is always
The ProfileSimSearch Algorithm • Runtime: • Linear in the multiset sizes for Q,T, plus a factor of |T|log(k)(Assuming O(1) calculation time for lowest common ancestors) • Memory use: • Linear in the query’s multiset size, in k, and in height(T) • Runs in a single post-order pass over T • Multisets of T’s nodes can be indexed in advance, for a quick implementation • If all multiset elements can be generated on-the-fly easily, no such preprocessing is necessary
Setup • State of the art for subtree similarity search: • TASM-postorder [Augsten et al.] • StructureSearch [Cohen] • Both algorithms use tree edit distance, and not profile distance functions • We also compare performance with the implementation of tree-to-tree distance using pq-grams by Augsten et al.
Setup • Data sets: • DBLP (17.6 million nodes) • XMark100 – XMark1800 (3.6 to 57.8 million nodes) • Sprot (9.4 million nodes) • Queries: • Random subtrees from the data • Extensive experimentation in paper • In the next slides, all times are in seconds
Varying |Q| (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested
Varying k (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested
Varying Dataset Size Different multiset-generating functions are compared here
Comparison with tree-to-tree pq-gram distance • A MySQL-based implementation of the pq-gram distance calculation routine given by Augsten et al. is compared to ProfileSimSearch • Note: ProfileSimSearch may output top-k results, while the other algorithm is designed to calculate pq-gram distance between two given trees Q,T • Both algorithms use an indexing stage over the database tree T, which is not measured in the following results
Conclusion and Future Work • We presented a definition capable of expressing a large general family of tree distance functions • Efficient and scalable algorithm for subtree search using this definition • Can also be used for tree search with a large set of trees • Future Work: • Use of upper bounds on subtree sizes or other attributes, to prune search space • Use a profile distance function to obtain bounds on tree edit distance, and modify the algorithm to calculate top-k using tree edit distance