300 likes | 375 Views
實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B). Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin. Outline. Introduction of basic data mining concepts about our research topics Brief description of doctoral research
E N D
實驗室研究暨成果說明會Content and Knowledge Management Laboratory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin
Outline • Introduction of basic data mining concepts about our research topics • Brief description of doctoral research • Topic 1: Mining frequent itemsets with multi-dimensional constraints • Topic 2: Mining the inter-transactional association rules of multi-dimensional interval patterns • Topic 3: Inter-sequence association rules mining • Topic 4: Mining association rules among time-series data
Introduction of Data Mining • Data mining is the task of discovering knowledge from large amounts of data. • One of the fundamental data mining problems, frequent itemset mining, covers a broad spectrum of mining topics, including association rules, sequential patterns, etc. • Frequent itemset mining is to discover all the itemsets whose supports in the database exceed a user-specified threshold.
Introduction of Association Rules • Association rule is of the form XY, where X and Y are both frequent itemsets in the given database and XY=. • The support of XY is the percentage of transactions in the given database that contain both X and Y, i.e., P(XY). • The confidence of XY is the percentage of transactions in the given database containing X that also contain Y, i.e., P(Y|X).
Introduction of Sequential Patterns • A sequence is an ordered list of itemsets, and denoted by <s1s2…sl>, where sj isan itemset. • sj isalso called an element of the sequence, and denoted as (x1x2…xm), where xk is an item. • The support of a sequence in a sequence database is the number of tuples containing . • A sequence is called a sequential pattern if support()min-support.
Join Support Check Algorithm for Mining Frequent Itemsets • Apriori • Candidate set generation-and–test • Level-wise: it iteratively generates candidate k-itemsets from previously found frequent (k-1)-itemsets, and then checks the supports of candidates to form frequent k-itemsets. • Lk-1 Ck Lk
Algorithm for Mining Frequent Itemsets (cont’d) • FP-growth • The method constructs a compressed frequent pattern tree, called FP-tree. • A divide-and-conquer strategy to recursively decompose the mining task into a set of smaller tasks in conditional databases, and concatenates the suffix itemset with the frequent itemsets generated from a conditional FP-tree.
Algorithm for Mining Sequential Patterns-PrefixSpan • It finds length-1 sequential patterns in the target database first, and partitions the database into smaller projected databases with prefix of each sequential pattern previously found. • The sequential patterns can be mined by constructing corresponding projected databases and mine each recursively. • It preserves the element order of each tuple in the mining process.
Brief Description of Doctoral Research • Mining calling path patterns in GSM networks • Two problems of mining calling path patterns • Mining PMFCPs • Mining periodic PMFCPs • Graph structures [(periodic) frequent calling path graph] and graph-based mining algorithms • Based on a depth-first • No candidate paths are generated and the database is scanned only once if the whole graph structure can be held in the main memory.
Brief Description of Doctoral Research (cont’d) • Bioinformatic data mining • Gene Clustering • Sequence comparisons, alignments and compression • DNA sequence • Protein sequence • Application • Phylogenetic tree to predict the function of a new protein • Relationship between DNA sequence & disease
Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints • Frequent itemset mining often generates a very large number of frequent itemsets. • Only thesubset of the frequent itemsets and association rules is of interest to users. • Users need additionalpost-processing to find useful ones. • Constraint-based mining pushes user-specific constraints deep inside the mining process to improve performance. • With multi-dimensional items, constraints can be imposed on multiple dimensional attributes.
attributes (dimensions) Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints Multi-dimensional Constraints itemID a1 a2…. am ik = (k1, k2 …, km) A = iA = (A1, A2,…, Am) A1=A.a1
Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints • Multi-dimensional constraints can be categorized according to constraint properties. • anti-monotone, monotone, convertible and inconvertible • It can be also classified according to the number of sub-constraints included. • Single constraint against multiple dimensions, Ex: max(S.cost) min(S.price) • Conjunction and/or disjunction of multiple sub-constraints, Ex: (C1: S.costv1) (C2: S.pricev2)
Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints • We extend constraints to place over multi-dimensional itemsets and develop algorithms for mining frequent itemsets with multi-dimensional constraints by extension of CFG (Constrained Frequent Pattern Growth), • Overview of our algorithm • Phase 1: Frequency check • Phase 2: Constraint check • Phase 3: Conditional database construction
C(BDECA)=false C(BDECA)=false C(BA)=false C(DA)=true C(EA)=true C(CA)=false C(B)=true C(D)=true C(E)=true C(C)=true C(A)=true Frequent items: Example: Cammax(S.cost) min(S.price) Frequent items: B, D, E, C Frequent items: B, D, E, C, A
Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns • Transaction could be the items bought by the same customer, the events happened on the same day, and so on. • Intra-transactional association rules: associations among items within the same transaction. • Ex: buy (X, diapers) => buy (X, beer) [support=80%] • Inter-transactional association rules: association relations among different transactions. • Ex: If the prices of IBM and SUN go up, Microsoft’s will most likely [80%] increases the next day.
Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns • Interval data are different from the point data in that they occupy regions of non-zero size. • Multi-dimensional Intervals can be represented as line segments (1-D), rectangles (2-D), hyper-cubes (n-D), etc. • Extended item: denoted as (Location)<Size> • Reference point: the smallest (Location)among all (Location)<Size>. • Maxspan: a sliding window; only associations covered by it are considered.
0,2,1<1,1,1> 1,1,0<2,2,1> Example • There are two cubes in the 3-dimensional space: 0,2,1<1,1,1> and 1,1,0<2,2,1>. • Reference point: (0,1,0) • The two items are denoted as 0,1,1<1,1,1> and 1,0,0<2,2,1>.
Support: 10% (10%*20=2) Maxspan: 4 L1: 0,0<1,1> 0,0<1,2> 0,0<1,3> 0,0<2,1> Algorithm (Apriori-like) Example
Join Support Check Algorithm (Apriori-like) Example (cont’d) • Remind: Apriori-like algorithm • Lk-1 • L2: {0,0<1,1>, 1,1<2,1>}, {1,0<1,1>, 0,1<1,2>}, {0,0<1,2>, 2,0<2,1>}, {0,0<1,3>, 3,0<1,2>} • L3: {3,0<1,1>, 2,1<1,2>, 0,3<1,3>} {1,0<1,1>, 0,1<1,2>, 2,1<2,1>} {3,0<1,1>, 0,3<1,3>, 4,1<2,1>} {2,0<1,2>, 0,2<1,3>, 4,0<2,1>} • L4: {0,3<1,3>, 4,1<2,1>, 2,1<1,2>, 3,0<1,1>} Ck Lk
Topic 3: Inter-sequence Association Rules Mining • Inter-sequence model Transaction ID : 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Transaction Time : <c(ab)d(ad)> <(bc)cb> <e(ac)bac> <b(ab)cc> < > <dd(ac)bd> <bc> <acc> <ab> <ceacc(ce)>
Topic 3: Inter-sequence Association Rules Mining (cont’d) • Extended sequence (denote asΔt<s1s2…sl>): a sequence s = <s1s2…sl> at time pointΔt. • Algorithm: • Step 1: Use PrefixSpan to find all sequential patterns • Step 2: Use an Apriori-like method to check if some extended sequence set is large • Use L-bucket (List-bucket) & C-bucket (candidate-bucket)to improve mining efficiency.
Example The database • min_support = 3 • maxspan = 2 PrefixSpan • Sequential Patterns: • <a>, <b>, <c> • <ab>, <(ab)>, <ac>, • <ba>, <bc>, <cb>, <cc> • <acc>
Example (cont’d) Apriori-like Lk-1→ Ck→ Lk
Topic 4: Mining Association Rules among Time-series Data • A line is an ordered and continuous list in the form {t1, t2, …, tm} describing the property of the subject along the time. • Step 1: find the frequent lines and points in each line-set. (Apriori-like algorithm) • Step 2: use those frequent-set combination to find the associations among them. (inter-transaction association rules)
Time-series Data Approximation • For the algorithm’s efficiency • Equally partition the fluctuation rate into several classes.
Step 2: Association Rule Mining Step 1: Line Discovery (Apriori-like)