540 likes | 934 Views
Text Mining. Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory. 1. Introduction What happens What is Text Mining Text Mining vs Data Mining Applications 2. Feature Extraction Task Indexing Dimensionality Reduction 3. Document Categorization Task Architecture
E N D
Text Mining Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory
1. Introduction What happens What is Text Mining Text Mining vs Data Mining Applications 2. Feature Extraction Task Indexing Dimensionality Reduction 3. Document Categorization Task Architecture Categorization Classifiers Application:Trend Analysis 4. Document Clustering Task Algorithms Application 5. Product 6.Reference 0. Content
1. Text Mining:Introduction 1.1 What happens 1.2 What’s Text Mining 1.3 Text Mining vs Data Mining 1.4 Application
1.1 Introduction:What happens(1) • Information explosive • 80% information stored in text • documents:journals, web pages,emails... • Difficult to extract special information • Current technologies... Internet
1.1 Introduction: What happens(2) • It is necessary to • automatically analyze, • organize,summarize... Text Mining La valeur des actions des sociétés XML vont augmenter Knowledge
1.2 Introduction: What’s Text Mining(1) • Text Mining::= the procedure of synthesizing the information • by analyzing the relations, the patterns, and the rules among • textual data - semi-structured or unstructured text. • Techniques::= data mining machine learning information retrieval statistics natural-language understanding case-based reasoning Ref:[1]-[4]
Sample Documents Text document Transformed Learning Representation models Domain specific templates/models knowledge Visualizations 1.2 Introduction: What’s Text Mining(2) Learning Working
Data MiningText Mining Data Object Numerical & categorical data Textual data Data structure Structure Unstructure&semi-structure Data representation Straightforward Complex Space dimension <tens of thousands > tens of thousands Methods Data analysis, machine learning Data mining, information statistic, neural networks retrieval, NLP,... Maturity Broad implementation since1994 Broad implementation starting 2000 Market 105 analysts at large and mid 108 analysts corporate workers size companies and individual users 1.3 Introduction:TM vs DM Ref:[12][13]
1.4 Introduction:Application • The potential applications are countless. • Customer profile analysis • Trend analysis • Information filtering and routing • Event tracks • news stories classification • Web search • ……. Ref:[12][13]
2. Feature Extraction 2.1 Task 2.2 Indexing 2.3 Weighting Model 2.4 Dimensionality Reduction Ref:[7][11][14][18][20][22]
All unique words/phrases All good words/phrases 2.1 Feature Extraction:Task(1) Task: Extract a good subset of words to represent documents Document collection Feature Extraction Ref:[7][11][14][18][20][22]
2.1 Feature Extraction:Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. 16 While-more-and-textual-information-is-available-online- effective-retrieval-difficult-without-good-indexing-text-content Feature Extraction 5 Text-information-online-retrieval-index 2 1 1 1 1 Ref:[7][11][14][18][20][22]
Training documents 2.2 Feature Extraction:Indexing(1) Identification all unique words Removal stop words • non-informative word • ex.{the,and,when,more} • Removal of suffix to generate word stem • grouping words • increasing the relevance • ex.{walker,walking}walk Word Stemming • Naive terms • Importance of term in Doc Term Weighting Ref:[7][11][14][18][20][22]
2.2 Feature Extraction:Indexing(2) • Document representations: vector space models d=(w1,w2,…wt)Rt wi is the weight of ith term in document d. Ref:[7][11][14][18][20][22]
A B K O Q R S T W X D1 3 1 0 1 1 1 1 1 1 1 D2 3 2 1 0 1 1 1 1 0 1 ABRTSAQWA XAO RTABBAXA QSAK 2.3 Feature Extraction:Weighting Model(1) • tf - Term Frequency weighting • wij = Freqij • Freqij : := the number of times jth term • occurs in document Di. • Drawback: without reflection of importance • factor for document discrimination. • Ex. D1 D2 Ref:[11][22]
2.3 Feature Extraction:Weighting Model(2) • tfidf - Inverse Document Frequency weighting • wij=Freqij* log(N/DocFreqj) . • N: := the number of documents in the training • document collection. • DocFreqj::= the number of documents in • which the jth term occurs. • Advantage: with reflection of importance factor for • document discrimination. • Assumption:terms with low DocFreq are better discriminator • than ones with high DocFreq in document collection • Ex. A B K O Q R S T W X D1 0 0 0 0.3 0 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0 0 Ref:[13] Ref:[11][22]
2.3 Feature Extraction:Weighting Model(3) • Entropy weighting where is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[13] Ref:[11][22]
2.4 Feature Extraction:Dimension Reduction • 2.4.1 Document Frequency Thresholding • 2.4.2 X2-statistic • 2.4.3 Latent Semantic Indexing Ref:[11][20][21][27]
Training documents D 2.4.1 Dimension Reduction:DocFreq Thresholding • Document Frequency Thresholding Naive Terms Calculates DocFreq(w) Sets threshold Removes all words: DocFreq < Feature Terms Ref:[11][20][21][27]
Category set C={c1,c2,..cm} 2.4.2 Dimension Reduction: X2-statistic • Assumption:a pre-defined category set for a training collection D • Goal: Estimation independence between term and category Naive Terms Term categorical score Sets threshold A:=|{d| dcj wd}| B:=|{d| dcj wd}| C:=|{d| dcj wd}| D:=|{d| dcj wd}| N:=|{d| dD}| Removes all words: X2max(w)< FEATURE TERMS Ref:[11][20][21][27]
' D0 D0 = I ' X (d,t) ' T0 T0 = I 2.4.3 Dimension Reduction:LSI(1) • LSI=Latent Semantic Indexing. 1.SVD Model:Singular Value Decomposition of matrix documents m<=min(t,d) * * ' X = T0 S0 * D0 terms * * (t,d) (t,m) (m,m) (m,d) ' documents = D0 S0 T0 (d,m) (m,m) (m,t) ' T0: t m orthogonal eigenvector matrix. Its rows are eigenvectors of X X ' D0: d m orthogonal eigenvector matrix. Its rows are eigenvectors of X X S0: m m diagonal matrix of singular value( square roots of eigenvalue) in decreasing order of importance. m: rank of matrix X. m<=min(t,d). Ref:[11][20][21][27]
X appr(X) = ' 2.4.3 Dimension Reduction:LSI(2) ! ! 2.Approximate Model Select k<=m * * X ' appr(X) = T S * D * * (t,d) (t,k) (k,k) (k,d) ' ' D S T ' • Every rows of matrix appr(X) approximately represents one documents. • Given a row xi and its corresponding row di, the following holds: ' xi = di S T and di = xi T S-1 Ref:[11][20][21][27]
2.4.3 Dimension Reduction:LSI(3) 3.Document Represent Model: tNaive Terms new document d d=(w1,w2,…,wt) Rt d appr(d) = d T S-1 Rk (1,k) (1,t) (t,k) (k, k) • No good methods to determine k. It depends on application domain. • Some experiments suggest: 100 300. Ref:[11][20][21][27]
3. Document Categorization 3.1 Task 3.2 Architecture 3.3 Categorization Classifiers 3.4 Application Ref:[1][2][4][5][11][18][23][24]
3.1 Categorization:Task • Task: assignment of one or more predefined • categories to one document. Topics Themes
Feature space 3.2 Categorization:Architecture Training documents New document d preprocessing Weighting Selecting feature Predefined categories Classifier Category(ies) tod
3.3 Categorization Classifiers 3.3.1Centroid-based Classifier 3.3.2 k-Nearest Neighbor Classifier 3.3.3 Naive Bayes Classifier
1.Input:new documentd =(w1, w2,…,wn); 2.Predefined categories:C={c1,c2,….,cl}; 3.//Compute centroid vector , ciC 4.//Similarity model - cosine function 5.//Compute similarity 6.//Output:Assign to documentdthe categorycmax 3.3.1 Model:Centroid-Based Classifier(1)
3.3.1 Model:Centroid-Based Classifier(2) • > • cos()<cos() • d2 is more close to d1 than d3 d3 d2 d1 • Cosine-based similarity model can reflect the relations between features.
3.3.2 Model:K-Nearest Neighbor Classifier 1.Input:new documentd; 2.training collection:D={d1,d2,…dn }; 3.predefined categories:C={c1,c2,….,cl}; 4.//Compute similarities for(diD){ Simil(d,di) =cos(d,di); } 5.//Select k-nearest neighbor Construct k-document subset Dk so that Simil(d,di) < min(Simil(d,doc) | doc Dk) di D- Dk. 6.//Compute score for each category for(ciC){ score(ci)=0; for(docDk){ score(ci)+=((docci)=true?1:0)} } 7.//Output:Assign to d the category c with the highest score: score(c) score(ci) , ci C- {c}
3.3.3 Model:Naive Bayes Classifier Basic assumption: all terms distribute in documents independently. 1.Input:new documentd; 2.predefined categories:C={c1,c2,….,cl}; 3.//Compute the probability that d is in each class c C for(ciC){ //note that terms wi in document are independent each other } 4.//output:Assigns to d the categoryc with the highest probability:
Time series data Trends Language models Language model Sample relevant docs 3.4 Categorization: ApplicationTrend Analysis-EAnalyst System Goal: Predicts the trends in stock price based on news stories Find Trends Trend cluster Piecewise linear fitting Align trends with docs Retrieve Docs Sample Textual data Sample docs News documents Trend::=(slope,confidence) Bayes Classifier Learning Process Categorization New trend Ref:[28]
4. Document Clustering 4.1 Task 4.2 Algorithms 4.3 Application Ref:[5][7][8][9][10][15][16][29]
4.1 Document Clustering:Task • Task: It groups all documents so that the documents in the same • group are more similar than ones in other groups. • Cluster hypothesis: relevant documents tend to be more • closely related to each other than to • non-relevant document. Ref:[5][7][8][9][10][15][16][29]
4.2 Document Clustering:Algorithms • 4.2.1 k-means • 4.2.2 Hierarchic Agglomerative Clustering(HAC) • 4.2.3 Association Rule Hypergraph Partitioning (ARHP) Ref:[5][7][8][9][10][15][16][29]
4.2.1 Document Clustering:k-means • k-means: distance-based flat clustering 0. Input: D::={d1,d2,…dn }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents • Advantage: • linear time complexity • works relatively well in low dimension space • Drawback: • distance computation in high dimension space • centroid vector may not well summarize the cluster documents • initial k clusters affect the quality of clusters Ref:[5][7][8][9][10][15][16][29]
4.2.2 Document Clustering:HAC • Hierarchic agglomerative clustering(HAC):distance-based • hierarchic clustering 0. Input: D::={d1,d2,…dn }; 1. Calculate similarity matrix SIM[i,j] 2. Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6. Output: dendogram of clusters • Advantage: • producing better quality clusters • works relatively well in low dimension space • Drawback: • distance computation in high dimension space • quadratic time complexity Ref:[5][7][8][9][10][15][16][29]
a 1 i 1/2 A D G f 3/2 h 5/6 b 1/3 g 1/3 B F e 3/4 C c 1 E 4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(1) • Hypergraph H=(V,E) V::a set of vertices E::a set of hyperedges. Ref:[30]-[35]
4.2.3 Document Clustering: Association Rule Hypergraph Partitioning (2) • Transactional View of Documents and features • item::=Document • transaction::=feature items Doc1 Doc2 Doc3 … Docn w1 5 5 2 ... 1 w2 2 4 3 … 5 w3 0 0 0 … 1 . . . . … . . . . . . . . . . ... . wt 6 0 0 … 3 (Transactional database of Documents and features) transactions Ref:[30]-[35]
4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(3) • Clustering Document-feature transaction database Discovering association rules Apriori algorithm Constructing hypergraph Association rule hypergraph Partitioning hypergraph Hypergraph partitioning algorithm k partitions • Hyperedges::=frequent item sets • Hyperedge weight::=average of the confidences of all rules • Assumption: documents occurring in the same frequent item set are more similar Ref:[30]-[35]
4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(4) • Advantage • Without the calculation of the mean of clusters. • Linear time complexity. • The quality of the clusters is not affected by the space dimensionality. • performing much better than traditional clustering in high dimensional space in terms of the quality of clusters and runtime. Ref:[30]-[35]
4.3 Document Clustering:Application • Summarization of documents • Navigation of large document collections • Organization of Web search results Ref:[10][15]-[17]
Name Extractions Term Extraction Abbreviation Extraction Relationship Extraction Feature extraction Categorization Summarization Clustering Text Analysis Tools Hierarchical Clustering Binary relational Clustering IMT Text search engine Web Searching Tools NetQuestion Solution Web Crawler 5. Product:Intelligent Miner for Text(IMT)(1) Ref:[5][36]
5. Product:Intelligent Miner for Text(IMT)(2) • 1.Feature extraction tools • 1.1 Information extraction • Extract linguistic items that represent document contents • 1.2 Feature extraction • Assign of different categories to vocabulary in documents, • Measure their importance to the document content. • 1.3 Name extraction • Locate names in text, • Determine what type of entity the name refers to • 1.4 Term extraction • Discover terms in text. Multiword technical terms • Recognize variants of the same concept • 1.5 Abbreviation recognition • Find abbreviation and math them with their full forms. • 1.6 Relation extraction
5. Product:Intelligent Miner for Text(IMT)(3) Feature extraction Demo.
5. Product:Intelligent Miner for Text(IMT)(4) • 2.Clustering tools • 2.1 Applications • Provide a overview of content in a large document collection • Identify hidden structures between groups of objects • Improve the browsing process to find similar or related information • Find outstanding documents within a collection • 2.2 Hierarchical clustering • Clusters are organized in a clustering tree and related clusters occurs in the same branch of tree. • 2.3 Binary relational clustering • Relationship of topics. • document cluster topic. • NB:preprocessing step for the categorization tool
5.Product:Intelligent Miner for Text(IMT)(5) • Clustering demo.:navigation of document collection
5. Product:Intelligent Miner for Text(IMT)(6) • 3.Summarization tools • 3.1 Steps • the most relevant sentences the relevancy of a sentence to a document a summary of the document with length set by user 3.2 Applications • Judge the relevancy of a full text • Easily determine whether the document is relevant to read. • Enrich search results The results of a query to a search engine can be enriched with a short • summary of each document. • Get a fast overview over document collections • summary full document
Black cat 5.Product:Intelligent Miner for Text(IMT)(7) • 4.Categorization tool • Applications • Organize intranet documents • Assign documents to folders • Dispatch requests • Forward news to subscribers sports News article categorizer cultures I like health news health new router politics economics vacations
6. Reference (1) Bibliography [1] Marti A. Hearst, Untangling Text Data Mining, Proceedings of ACL’99 the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper) http://www.sims.berkeley.edu/~hearst [2] Feldman and Dagan 1995 KDT - knowledge discovery in texts. In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal. [3] IJCAI-99 Workshop TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS Stockholm, Sweden August 2, 1999 http://www.cs.biu.ac.il/~feldman/ijcai-workshop%20cfp.html [4] Taeho C. Jo Text Categorization considering Categorical Weights and Substantial Weights of Informative Keywords, 1999 (http://www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-13.htm) [5] A White Paper from IBM TechnologyText Mining: Turning Information Into Knowledge, February 17, 1998 editor: Daniel Tkach IBM Software Solutions ( http://allen.comm.virginia.edu/jtl5t/whiteweb.html) [6] http://allen.comm.virginia.edu/jtl5t/index.htm [7] G. Salton et al, Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983 [8] Michael Steinbach and George Karypis and Vipin Kumar, A Comparison of Document Clustering Techiques, KDD-2000 [9] Douglass R. Cutting, Divid R. Karger, Jan O. Pedersen, and John W. Tukey, Scatter/Gather : A Cluster-based Approach to Browsing large Document Collections, SIGIR ’92,Pages 318 - 329 [10] Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents, KDD ’97, Pages 287 – 290, 1997