180 likes | 318 Views
Online Mining of Frequent Query Trees over XML Data Streams. Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author. Outline. Introduction
E N D
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author hfli@csie.nctu.edu.tw
Outline • Introduction • Mining of Data Streams, Tree Mining • Problem Definition • Online Mining of Frequent Query Trees over XML Data Streams • The Proposed Algorithm • FQT-Stream (Frequent Query Trees of Streams) • Conclusions and Future Work hfli@csie.nctu.edu.tw
Mining of Data Streams: Motivations • Many Applications generate data streams • Day to day business (credit card, ATM transactions, etc) • Hot Web services (XML data, record and click streams) • Telecommunication (call records) • Financial market (stock exchange) • Surveillance (sensor network, audio/video) • System management (network events) • Application characteristics • Massive volumes of data (several terabytes) • Records arrive at a rapid rate • Data distribution changes on the fly • What do we want to get from data streams ? • Real time query answering, Statistics, and Pattern discovery hfli@csie.nctu.edu.tw
Synopsis in Memory Buffer Stream Mining Processor (Approximate) Results Data Streams Mining of Data Streams: Computation Model • Requirements of Mining Data Streams • Single pass: each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time (to maintain synopsis) must be low hfli@csie.nctu.edu.tw
Problem Definition of Frequent Query Tree Mining (1/2) • XML Query Tree Stream (XQTS) • A sequence of query trees (QTs) • QT1, QT2, …, QTN • N is tree id the latest incoming query tree • Support of a Query Tree QTi • sup(QTi): the number of QTs in XQTS containing QTi as a subtree hfli@csie.nctu.edu.tw
Problem Definition of Frequent Query Tree Mining (2/2) • A QTi is a Frequent Query Tree (FQT) • if and only if sup(QTi) sN • s is a user-defined minimum support threshold in the range of [0, 1] • Our Task • To mine the set of all frequent query trees (FQTs) by one scan of the XQTS • Using as smaller memory as possible hfli@csie.nctu.edu.tw
Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams) • FQT-Stream consists of 5 phases • 1. read a QT (Query Tree) from the buffer in the main memory • 2. transform the QT into a new NQTS (Normalized Query Tree Sequence) representation • 3. construct a in-memory summary data structure called FQT-forest (a forest of Frequent Query Trees) by projecting the NQTSs • 4. prune the infrequent query trees from FQT-forest • 5. find the set of all FQTs (Frequent Query Trees) from current FQT-forest • Since phase 1 is straightforward, • We focus on phases 2-5 hfli@csie.nctu.edu.tw
Phase 2 of FQT-Stream: NQTS Transformation • NQTS Transformation of QT • Using DFS on the QT • A sequence of triple (node-id, level, order) • level: the level of the QT • order: sequence order of the NQTS • For example (5-NQTS in Figure 1) hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: FQT-forest Construction (1/4) • For each NQTS, 2 steps are performed to construct the FQT-forest • Step 1: enumerate each NQTS into a set of sub-sequences using Order-Break (OB) technique • OB is a level-wise method hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (2/4) • For example, a 5-NQTS = <(A, 0, 1), (B, 1, 2), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • First, the 5-NQTS is broken into three4-NQTSs • <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)> • These sequences are 1-OB (One Order Break) • 1-OB sequences have oneorder break in the sequence order • The original 5-NQTS is called 0-OB hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (3/4) • After delete the duplicates • Three 4-NQTSs Two 3-NQTSs with One Order Break • Two 3-NQTSs One 2-NQTS • <(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1, 2), (C, 1, 5)><(A, 0, 1), (C, 1, 5)> • Finally, the set of 1-OB contains 8 NQTSs hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (4/4) • Set of 2-OB is generated from the set of 1-OB • For example • 2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from 1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • Repeat this process until no candidate k-OB • Property 1 • The maximum size of order break is k-3, i.e., (k-3)-OB, if the query tree has k nodes hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (1/3) • The OBs (0-OB, 1-OB, 2-OB) are projected and inserted into a FQT-forest using Incremental Projection (IP) technique • A NQTS, <X1X2…Xi>, with i nodes is projected into i sub-NQTSs (also called node-suffix NQTSs) • <Xi>, <XiXi-1>, …, <X2>, <X1> • We use one field node-id to represent the fields (node-id, level, order) for simplicity hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (2/3) • Example of IP • 1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected into 4 node-suffix NQTSs as follows • <(C, 1, 5)> • <(E, 2, 4), (C, 1, 5)> • <(D, 2, 3), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • After projection, a tree structure checking is preformed • If the level of the first node in a node-suffix NQTS is not the smallest level • the node-suffix NQTS is deleted hfli@csie.nctu.edu.tw
Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (3/3) • After tree structure checking • The node-suffix NQTSs are inserted into FQT-forest • Update the corresponding nodes’ supports • FQT-forest consists of 2 parts • FN-list • A list of Frequent Nodes • Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree) • NQTS-trees (trees of Normalized Query Tree Sequences) • A sequence (NQTS) is represented by a path • And its appearance frequent is maintained in the last of node of the path hfli@csie.nctu.edu.tw
Phase 4 of FQT-Stream: Infrequent Information Pruning • In order to guarantee the limited space requirement • Pruning Infrequent Information • Pruning steps • Check each node Xi in the FN-list of FQT-forest • If its sup(Xi) < sN delete Xi and its NQTS-tree • Check other NQTS-trees to prune these infrequent nodes hfli@csie.nctu.edu.tw
Phase 4 of FQT-Stream: Frequent Query Tree Mining • Assume that there are k frequent nodes, <X1, X2, …, Xk>, in the FN-list • FQT-Stream traverses the Xi.NQTS-tree (i, i = 1, 2, …, k) to find the sequences with prefix Xi whose estimated support is greater than or equal to sN in a DFS manner • These frequent query trees are stored into a temporal list, called FQT-List hfli@csie.nctu.edu.tw
Conclusions and Future Work • We propose an efficient one-pass algorithm FQT-Stream (Frequent Query Trees of Streams) • To find the set of all frequent query trees over the entire history of online XML data streams • Future Work • Online Mining of Frequent Query Trees over Sliding Windows hfli@csie.nctu.edu.tw