290 likes | 392 Views
Generating Semantic Annotations for Frequent Patterns with Context Analysis. Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign June 6, 2014. Itemsets:. diaper. milk. camera. film. ;. ; …. Sequential Patterns:.
E N D
Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign June 6, 2014
Itemsets: diaper milk camera film ; ; … Sequential Patterns: ... MiningClosedFrequentGraph Patterns… … Mining Graph and Structured Patterns in ... … Subgraph Patterns: Frequent Pattern Mining( [Agrawal & Srikant 94] and many others) Database Frequent Patterns D E F C A B AB EF AE CD CE DE AF BE BF CDE ABE ABF
Toward Understanding the Patterns-- Find Canonical Patterns Database Frequent Patterns D E F C A B AB EF AE CD CE DE ( Yan et al ‘05) AF BE BF CDE ABE ABF ( Xin et al ‘05)
Toward Understanding the Patterns-- How to Interpret Patterns? • Do they all make sense? • What do they mean? • How are they useful? diaper beer female sterile (2) tekele morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only those with meanings… Our goal: Annotate patterns with semantic information
Challenges • How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?) • How can we infer pattern semantics? (How to annotate?) • How can we do it in a general way? (Do it for all kinds of patterns) • Once such annotations are generated, what can we use them for? (Applications)
Word: “pattern” – from Merriam-Webster Non-semantic info. Definitions indicating semantics Examples of Usage Synonyms Related Words A Dictionary Analogy
Pattern: “latent semantic analysis” Non-Semantic: sequential; close; sup = 0.1% Context Indicators (CI): “indexing”, “semantic”, “S. Dumais”, “singular value decomposition”, … Representative Transactions: index by latent semantic analysis probablist latent semantic analysis Semantically similar Patterns (SSP): “latent semantic indexing”, “LSA”, “PLSA” What about a “Pattern Dictionary”?-- Semantic Pattern Annotation (SPA) Word: Pattern Non-Semantic: function; pronunciation; date; etc. Definitions: A form or model proposed for … Related words: original, constellation … Examples: a dressmaker’s pattern a pattern of dissent Synonyms design, device, motif, motive…
Frequent Patterns P1: AB ? P2: CD P3: … Pn: How Can We Generate Such an Entry? Semantic Annotations Database … How to infer the semantics of a frequent pattern?
Context Pattern {A,B}:{ … Baby, Milk, Diaper, Toy, Soymilk… } {C,D}: { … Printer, Film, Camera, Lens, … } Continue the Analogy… “You shall know a word by the company it keeps.” - Firth 1957 Data … association … pattern … MINE … algorithm … mountain … Africa … diamond … MINE … weight … You’ll know the meaning of a pattern by its context
Context Units <E, F, …, EF, … ABE> <E, F, …, EF, …,CDEF> Context Units = Objects co-occurring with p Our Approach: Model the Context Semantic Annotations Database Frequent Patterns P1: AB P2: CD … … Pn:
Semantic Analysis with Context Models • Task1: Model the context of a frequent pattern Based on the Context Model… • Task2: Extract strongest context indicators • Task3: Extract representative transactions • Task4: Extractsemantically similar patterns
< 2.0, 2.0, …, 1.0, … , 1.0 > < 2.0, 2.0, …, 1.0, … , 1.0 > Co-occurrence Cosine Similarity Mutual Information Pearson Coefficient Context Unit Weight: Context Similarity: …… …… Task1: Context Modeling - A Vector Space Model Context Units Semantic Annotations Frequent Patterns Database <E, F, …, EF, … ABE> <E, F, …, EF, … ABE> P1: AB <E, F, …, EF, …,CDEF> … P2: CD … Pn:
Single items , , … diaper milk printer , itemsets milk lotion camera t2 transactions t1 Context Unit Selection t1 diaper milk babywear lotion t2 camera memory stick printer Valid Context Units: In general, Context Units are frequent patterns
Context Unit Selection: Redundancy Removal • Problem: too many valid context units, most are redundant • { Diaper, milk, babywear }: “diaper”, “diaper, milk”, “milk, babywear”, “milk, lotion”, … • Solution: • use close patterns • micro-clustering: (hierarchical, one-pass) • Jaccard Distance (γ: threshold to stop clustering):
Context Unit Weighting < 3.0, 0, … 2.0, … , 1.0, …> AB 3.0EF 2.0ABE 1.0… Task2: Extract Context Indicators Semantic Annotations Context Units Frequent Patterns Database < AB, CD, … , EF, … ABE, …> <A, B, AB, C, D, CD, E, F, EF, AE, BF, … ABE, ABF,…, ABEF> P1: AB … P2: CD … Pn:
T1: 1.0, 0, …,1.0, … , 1.0 T5: Semantic Similarity T5 0.8T1 0.6T3 0.6… Task3: Extract Representative Transactions Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …
P2: CD 0, 3.0, …,2.0, … , 0.5 Pk: EF Semantic Similarity CD 0.7BF 0.5EF 0.3… AB: Task4: Extract Semantically Similar Patterns Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …
Experiments • Three different real world applications • Annotating DBLP title/authors Patterns • Motif/Gene-Ontology (GO) matching • Gene Synonyms extraction • Study the effectiveness of the proposed SPA methods • Explore applications of SPA to different real world tasks
P1: { x_yan, j_han } Frequent Itemset P2: “substructure search” Frequent Sequential Pattern Context Units < { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … > Annotating DBLP Co-authorship and Title Pattern Database: Frequent Patterns Authors Title X.Yan, P. Yu, J. Han Substructure Similarity Search in Graph Databases … … … … Semantic Annotations
DBLP Results: Frequent Itemset Pattern= {xifeng_yan, jiawei_han} Annotations:
DBLP Results: Freq. Seq. Pattern Pattern= “Information … retrieval” Annotations:
GO term 1 Sequence 1 GO term 2 motif1 motif2 Sequence 2 GO term 3 motif2 motif3 GO term 4 Sequence 3 GO term 5 motif2 motif4 motif5 Motif-GO Matching ? motif2 Motif: a subsequence pattern in the sequences Gene Ontology (GO) terms: annotating the functionality of sequence, motifs
Motif 1 P1: Motif1 Sequential Pattern P2: GOTerm2 Single Item Pattern Context Units < Motif1, Motif3, …, GOTerm1, GOTerm2, … > Motif-GO Matching (Cont.) Database: Frequent Patterns Protein Sequence GO terms GOTerm1; GOTerm2;GOTerm3 GOTerm3 … … Motif-GO matching Semantic Annotations
Motif/GO Matching: Evaluation • Gold standard generated by human experts • Measure: Mean reciprocal rank (MRR) • Reflects ranking accuracy (the higher the better) • 1/Rank (0.5 means the correct answer is ranked as the 2nd ) • Results: Weights for Context Units: Ranking Strategy
Gene Synonym Extraction • Gene Synonyms: • A Sequential Pattern in the textual database • Matching gene synonyms: a challenging and important new problem in mining biology data • Analogy: thesaurus or synonyms in dictionary
P1: female sterile (2) tekele Sequential Pattern P2: Fs(2)Tek Sequential Pattern Context Units < gene, female, …, d. melanogaster gene, … > Context Units: context units can be single words or sequential patterns Gene Synonym Extraction (Cont.) Database: Frequent Patterns Biomedical Sentences … D. melanogaster gene Female sterile (2) Tekele … … Female sterile (2) Tekele , abbreviated as Fs(2)Tek … … Matched Synonyms Semantic Annotations
Gene Synonym Extraction: Results MRR: hierarchical MRR: one-pass • Effective! MRR > 0.5 • frequent pattern >> single words • Micro-clustering is useful Running time: hierarchical Running time: one-pass
Conclusions • A novel problem: semantical pattern annotation • A structured annotation for frequent patterns • A general method based on context modeling • A general post-processing procedure of frequent pattern mining on any types of pattern • Applicable to and effective for quite different tasks • Future work: • Tune for specific tasks • Better context unit weights, redundancy removal, etc