470 likes | 667 Views
Efficient Algorithms for Mining Semi-structured Data. Joint work with Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Setsuo Arikawa (Kyushu Univ.). Outline. Efficient Text Data Mining Fast and Robust Text Mining Algorithm (ALT'98, ISSAC'98, DS'98)
E N D
Efficient Algorithms for Mining Semi-structured Data Joint work with Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Setsuo Arikawa (Kyushu Univ.)
Outline Efficient Text Data Mining • Fast and Robust Text Mining Algorithm (ALT'98, ISSAC'98, DS'98) • Efficient Text Index for Data Mining (CPM'01 , CPM'02) • Text Mining on External Storage (PAKDD'00) • Applications • Interactive Document browsing • Keyword discovery form Web Towards Semi-structured Data Mining • Efficinet Frequent Tree Miner (SDM'02, PKDD'02) • Mining Semi-structured Data Streams (ICDM '02) Information Extraction from Web (GI'00, FLAIRS'01) Conclusion
people person person name tel email @age @id @id name #text #text #text 25 608 609 #text John 555-4567 john@abc.com Mary Semi-structured Data • Many semi-structured data on networks • XML data [W3C 00], Web/HTML pages • Demand for mining useful information from a large semi-structured database.(Semi-structured Data Mining) • Tag • Attribute • Text <people> <person age=“25” id=“608”> <name>John</name> <email>john@abc.com</email> </person> <person id=“609”> <name>Mary</name> <tel>555-4567</tel> </person> </people>
P #text B FONT #text #text @color @face #text blue Times Goal of this study • An efficient algorithm for finding frequent substructures from semi-structured data. • Labeled ordered trees (as graphs; not a set of paths) • Frequent pattern discovery [Agrawal 1994] • Efficient for long patterns [Bayardo 1997] Minsup = 5(%)
people person person name tel email @age @id @id name #text #text #text 25 608 609 #text John 555-4567 john@abc.com Mary Model of Semi-structured Data • Labeled ordered trees • Each node has a label, which corresponds to: • Markup tag • Attribute & value • Text string • The children of a node are ordered from left to right by the sibling relation • Each node can have unbound number of children (unranked) • Labeled (unordered) trees • Labeled graphs <people> <person age=“25” id=“608”> <name>John</name> <email>john@abc.com</email> </person> <person id=“609”> <name>Mary</name> <tel>555-4567</tel> </person> </people>
P #text B FONT #text #text @color @face #text blue Times What is Semi-structured Data Mining? Finding characteristic subgraphs (patterns) from a given set of labeled trees or graphs • Characteristic pattern • Frequent pattern:a pattern occurring in many graphs • Optimized pattern:a pattern distinguishing two different sets of graphs Minsup = 5(%)
History of Semi-structured Data Mining Finding subgraphs by MDL principleSubdue [Holder et al. (KDD’94)] ~1995 1996 1997 1998 1999 2000 2001 2002 Finding frequent paths[Wang and Liu (KDD’97)] Finding Semi-structured Schema [Nestrov, Abiteboul et al. (SIGMOD’98)] Finding frequent subgraphsAGM [Inokuchi et al. (PKDD’00)] Finding frequent subgraphsFSG [Kuramochi et al. (ICDM’01)] Finding frequent ordered treesFREQT [Asai et al. (SDM’02)],Treeminer [Zaki (KDD’02)] Finding frequent subgraphs[Venetik, Gudes, et al. (ICDM’02)],gSpan [Yan and Han (ICDM’02)]
Efficient Algorithms for Discovering Frequent Labeled Ordered Trees FREQT [Asai et al. (SDM’02, PKDD’02)] • Efficient enumeration of labeled ordered treesusingrightmost expansion technique. • Incremental updating ofrightmost leaf occurrences. TreeminerV [Zaki (SIGKDD’02)] • Enumeration technique is same with ours. • Counting method is different from ours. • Independent from ours.
Tree Matching Pattern tree Tmatchesa data tree D There is amatching functionf from T into D. (T occurs in D ) T A r D B C A C • f is 1-to-1. • f preserves parent-child relation. • f preserves (indirect) sibling relation. • f preserves labels. P1 B A C A B B B C P2
The occurrences of a pattern • A root occurrence of T: • The node to which the root of T maps by a matching function • The root count of T: • The number of distinct root occurrences of T in D. T A r D 1 B C A C P1 2 7 B A C A B Root occurrence list 3 5 4 11 8 OccD(T) = {2, 8} B B C P2 6 9 10
Algorithm FREQT • Stage 1: Compute F1. • Stage k: Compute Fk from Fk-1.(k =2,3, …) • Compute k-patterns by the rightmost expansion (Ck from Fk-1 ) • Update their rightmost-leaf occurrences. • Select the frequent k-patterns in Ck. (Fk fromCk ) Fk: the set of all the frequent k-patterns Ck: the candidate set for Fk. (k-patterns =patterns of size k).
Rightmost Expansion • (d,l)-expansion T of tree S • The tree T obtained by attaching new node k to the rightmost branch of S. • k is the rightmost leaf of T. • (d, l): depth and label of k • The rightmost expansion of S 1 d -1 k l S k-1
Ordered tree enumeration tree[Asai et al., SDM’02; Zaki, SIGKDD’02] A generalization of set enumeration tree [Bayardo 97] for ordered trees ⊥ B (0,B) (1,B) (1,B) B B B B (0,A) B (1,A) (1,A) A B B (2,B) (1,A) (1,B) (2,A) B A A B A B A • The root is the empty tree. • Each node is an ordered tree, and has its (d, l)-expansions as its children. B B B A B A
Incremental Computation of the Rightmost-Leaf Occurrences • Scan the list of old rightmost-leaf occurrences • For each old occurrence x, • Go upward to the (p-1)th parenth of x • Starting from h, scan its proper younger siblings. • Add siblings with label A to the new rightmost occ. list. 1 (p,A)-expansion of T Data Tree D k p p h A p - 1 A B C A A Pattern T proper younger siblings k-1 An occurrence of T x List of Old Right-most occurrences An old occurrence
Performance Study of FREQT • Dataset: citeseers • Minimum support: s=3.0(%) fixed • Increasing the data size from 0.3MB to 5.6MB. Runtime (sec) 178,285 nodes, 1.39 sec # of nodes
s = 2(%) 37.1(sec)3.29(sec)1.15(sec) 3 times faster(by Pruning) 10 times faster (by DD) Algorithm Comparison Run Time(Sec) Minimum sup.(%)
Experiment:FREQTFrequent Substructure Discovery from Web <a href=“_”> <font color=“#6F6F6F”> #text_1 </font> </a> <p> #text_2 <b> #text_3 <!-- CITE--> <font color=“green”> #text_4 </font> #text_5 </b> #text_6 <br /><br /> <font color=“#999999”> #text_7 <i> #text_8 </i> #text_9 </font> </p> • Effictive for schema discovery • DataGuide [Widom, Garcia-Molina et al. (VLDB’97)]
Optimized Pattern Discovery Algorithm OPTT[Abe et al. (PKDD’02)] • Find such patterns as … • frequent in positive data, and • infrequent in negative data. • Applicable to classification of trees and graphs Pattern P matched unmatched Positive data Negative data
Ex. of optimized patterns #Occ:Pos 10,Neg 0<movie> <certification> <certif> sweden:15 </certif> </certification></movie> #Occ:Pos 1,Neg 12<movie> <title /> <genre> animation <genre></movie> Experiment: OPTTOptimized Pattern Discovery from XML Data Pos Data: Action movie ×15 Neg Data: Family movie ×15 AlgorithmOPTT • Effictive for classification of semi-structured data
Population N Pattern Population N1 Population N0 Split ! S1 S0 (M1/N1) (M0/N0) Impurity function Optimized Rule/Pattern Discovery Evaluation function for pattern GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)
Theoretical results • Theorem: The algorithm OPTT solves the maximum agreement problem for labeled ordered treesin averagetimeO(kk bkN). • (Note: A straightforward algorithm has super linear time complexity when the number of labels grows in N). • Theorem: If the maximum sizek of subwords is unbounded, For any e > 0, there exists no polynomial time (770/767 - e)-approximation algorithm for the maximum agreement problem for labeled ordered trees of unbounded size on an unbounded label alphabet if P /=NP. Proc. SIAM Data Mining 02 (2001), and Proc. PKDD'02 (2002) アルゴリズム詳細
Mining Semi-structured Data Streams <moviedb><movie><title>Godfather</title><year>1972</year><directed_by><person><name>Francis Ford Coppola </name> <birth_name> Francis Ford Coppola </birth_name> <date_of_birth> <day> 7 April </day> <year> 1939 </year> <locate> Detroit, Michigan, USA </locate> </date_of_birth> <mini_biography> He was born in 1939 in Detroit, USA, but he grew up in a New York </mini_biography> <sometimes_credited> Thomas Colchart </sometimes_credited> <sometimes_credited> Francis Coppola </sometimes_credited> <filmography> <Producer> <title> Assassination Tango (2002) </title> <title>Pumpkin (2002) </title><title>No Such Thing (2001)</title> <title>Another Day (2001) (TV) </title> <title> Jeepers Creepers (2001)</title> <title>CQ (2001) </title> <title> Sleepy Hollow (1999)</title> <title> Goosed (1999/I) </title> <title>Third Miracle, The (1999) </title> <title>Virgin Suicides, The (1999) </title> <title>Florentine, The (1999) </title> <title>Lanai-Loa (1998) </title> <title> “First Wave” (1998) </title> <title> Moby Dick (1998) (TV) </title> <title> Outrage (1998) (TV) </title> <title> Buddy (1997) </title> …… • Emerging applications on Internets • Eg. Network monitoring, web management, e-commerce • Not a static collection but a transient data stream • Unbounded, Rapid, Continuous, Time varying • Traditional data mining methods cannot be directly applied. Mining algorithm for semi-structured data streamsStreamT [Asai et al. (ICDM’02)] SAX event stream … …
Semi-structured Data Stream • (v1, v2, … , vi, …)∈(N×L)∞ • vi = (di,li): depth and label of node i (depth, label) pair representation: Data tree D Semi-structured data stream w.r.t. D (0,R), (1,A), (2,B), (2,A), (2,C), (3,B), (1,C), (2,A), (3,B), (3,C), (2,B) R 1 A C 2 7 B A C A B 3 5 4 11 8 B B C 6 9 10
Example moviedb movie title directed_by year Each (depth, label)-pair in a stream corresponds to an open parenthesis in XML data person date_of_birth name birth_name (depth, label) pair representation XML data <moviedb> <movie> <title> Godfather </title> <year> 1972 </year> <directed_by> <person> <name> Francis Ford Coppola </name> <birth_name> Francis Ford Coppola </birth_name> <date_of_birth> <day> 7 April </day> <year> 1939 </year> . . . (0, moviedb), (1, movie), (2, title), (3, “Godfathar”), (2, yaer), (3, “1972”), (2, directed_by), (3, person), (4, name), (5, “Francis Ford Coppola”), (4,birth_name), (5, “Francis Ford Coppola”), (4, data_of_birth), (5, day), (6, “7 April”), (5, year), (6, “1939”), . . .
Offline vs. Online FREQT (Offline) Horizontal Scan(Level-wise search) StreamT (Online) Vertical Scan 1 2 …i…n Data 1 2 …i…n Data 1 2 k 1 2 k Pattern size Pattern size
Related Works: Online Data Mining • Brin et al. [SIGMOD’97] • Dynamic Itemset Counting. • Mining association rules using fewer number of scans. • Hidber [SIGMOD’99] • Carma: Online mining of association rules from transaction data streams. • Manku & Motowani [VLDB’02] • Approximate online mining algorithms for frequent items and itemsets from data streams. • New candidate management policy which has a provable space complexity.
Main Result • StreamT: an online algorithm for finding frequent labeled ordered trees from semi-structured data streams • Techniques • Plain sweeping technique • Adaptive candidate management • Extensions to various online models • Theoretical and Empirical Analysis
Sweep branchSB • The unique path from the root to the current nodevi • The algorithm sweeps the sweep branch SB rightwards • Records the occurrences of the candidate patterns on SB • Use root and bottom occurrences
Tree sweeping technique # Occurrences 4
Tree sweeping technique # Occurrences 0 sweep branch
Tree sweeping technique # Occurrences 0
Tree sweeping technique # Occurrences 1
Tree sweeping technique # Occurrences 2
Tree sweeping technique # Occurrences 3
Tree sweeping technique # Occurrences 4
Tree sweeping technique # Occurrences 4
Tree sweeping technique # Occurrences 5
Tree sweeping technique # Occurrences 6
Tree sweeping technique # Occurrences 7
Bottom occurrences of a pattern PatternT∈C • Property: the intersection of an embedding of a pattern and the SB forms a chain of nodes. • We record only the pair of the root and the bottom occurrences of a pattern T, instead of the whole occurrence of T The intersection of SB and the tree. Root occurrence vR Left-half treeDi Bottom occurrence vB SBi
Sweep Branch Stack B i-th Sweep branch stack B • Property: The d-th bucket of the sweep branch stack B keeps all candidate patterns that has the bottom occurrence with depth don the current sweep branch Time (i-1) depth d depth d (k-1)-pattern S i-th branch SB(i)
How to maintain the SB-stack: Summary The next nodeof depth d Classify the candidates with their bottom depth 0 . . . d-1 d . . . Case 1 Case 2 REMOVE Time (i-1) UNCHANGE CHANGE Case 3 UNCHANGE Time i RIGHTMOST-EXPANTION Case 4
⊥ Online candidate management policy [Hidber’99] Observation (Monotonicity)A pattern is frequent only if its predecessor is frequent. • Initially insert all patterns of size 1 to C . • Predecessor of T is frequent => insert T in C • Predecessor of T is infrequent => delete T from C Set F of frequent patterns of stage i Set C of candidate patterns of stage i
Various online models • Basic model [Hidber 99] • Sliding window model [Manilla et al. 95] • Forgetting model [Yamanishi et al. 00] time i Unsuitable to tracking rapid trend changes Window size w i-w+1 time i Forgetting factor: g gi-j j time i past now
Online Scalability • Minimum support: s=1.0(%) Runtime (sec) # of nodes
mem = 575MB mem = 64MB 307286 1,646 = 149,521 + 307,286 - 455,161 455161 mem = 575MB
Experiments:Performance and effectiveness of forgetting 1,348 (sec) Effectiveness of forgetting 3,200,000 (nodes) --- Basic --- Forgetting Performance • Data size: 130MB • # of nodes: 3,185,138 • # of labels: 72