160 likes | 270 Views
Efficient Mining of XML Query Patterns for Caching. L.H. Yang, M.L. Lee, and W. Hsu Proceedings of 29th VLDB Conference, 2003. Introduction. Present an efficient algorithm, called FastXminer discover frequent XML query patterns
E N D
Efficient Mining of XML Query Patterns for Caching L.H. Yang, M.L. Lee, and W. Hsu Proceedings of 29th VLDB Conference, 2003 Speaker: Chao-Chen Chiu
Introduction • Present an efficient algorithm, called FastXminer • discover frequent XML query patterns • only a small subset of generated candidate patterns needs tree containment tests
Query Pattern Tree • Query Pattern tree • A rooted tree QPT<V, E> • Each edge e = (v1, v2) • v1 is the parent of v2 • Each vertex v has a label in {“*”, “//”, tagSet} • Rooted Subtree • A rooted subtree RST <V’, E’> • Root(RST) = Root(QPT) • V’ V, E’ E • A k-edge rooted subtree if it has k edges
Tree Inclusion • Partial Ordering of Labels • Give two labels x and x’, if x = x’, then x≦x’ • For any label x tagset, define x ≦ * ≦//
Tree Inclusion • A RST is contained in a QPT if the following hold: • The root nodes in RST and QPT have the same label • If a node w RST is matched with node v QPT, then it satisfies (a)w.label ≦v.label (b)each subtree of w is contained in some subtree of QPT • From XQPMiner, we know the tree containment tests is expensive!
But just single-branch candidate RST Mining Query Pattern Trees find all frequent 1-edge RSTs by scaning Database once FastRstGen generate the candidate set Ck+1 by using the previously found frequent set Fk and pruning those unqualified candidates. Contains determines if RSTk+1 is contained in the pattern tree t.
Candidate Generation • Schema-guided enumeration • Global query pattern tree(GQPT) • Use string to represent QPT • “1, 2, -1, 3, -1, 8”
Candidate Generation • Rightmost Branch Expansion • Given a k-edge RSTk, only expand its rightmost branch • Get a set of RSTk+1s, all of them have the prefix RSTk • Two kinds of expansions from RSTk to RSTk+1 • Join of two RSTks • Rightmost leaf node expansion
Candidate Generation • Join of two RSTks but don’t expand rightmost leaf node • They must have the same prefix of k nodes, that is, in the same equivalence class • Rightmost leaf node expansion • i-branch RSTk+1 (i > 1) • Join two RSTks • Single branch RSTk+1 • Need the tree containment tests
Candidate Generation • Example
Frequency Counting • Avoid tree inclusion test • For join part: if RSTijk+1=RSTik RSTjk then compute RSTk+1.TIDlist = RSTik.TIDlist RSTjk.TIDlist • For rmlne part: if RSTijk+1 is a muti-branch RST, then it is a join of two k-edge RST • Only single-branch RST need tree inclusion test! • Pruning Strategy If k+1-edge RST is frequent, then all its k-edge RSTs must be frequent
Algorithm FastRSTGen Rightmost leaf node expansion
Algorithm FastRSTGen Rightmost leaf node expansion
Performance Study • P4 2.4GHz, 1GB RAM, Windows XP • Characteristics of Datasets
Performance Study • Effect of Minimum Support