Document Clustering

Document Clustering Content: Document Clustering Essentials. Text Clustering Architecture Preprocessing Different Document Models Probabilistic model Vector space model (VSM) Ontology-based VSM Document Representation Ontology and Semantic Enhancement of Presentation Models Ontology-based VSM

Document Clustering Essentials

Text Clustering Architecture Text Documents Clustering Algorithms Preprocessing VSM Data Modeling Clustering Clusters Ontology Tf-Idf0 BOW (Bag of Visual Words)

Preprocessing • Removing Stopwords • Stopwords • Function words and connectives • Appear in a large number of documents and have little use in • describing the characteristics of documents. • Example • Removing Stopwords • Stopwords: • “of”, “a”, “by”, “and” , “the”, “instead” • Example • “direct prediction continuous output variable method discretizes variable kMeans clustering solves resultant classification problem”

Preprocessing • Stemming • Remove inflections that convey parts of speech, tense. • Techniques • Morphological analysis (e.g., Porter’s algorithm) • Dictionary lookup (e.g., WordNet) • Stems: • “prediction --->predict” • “discretizes --->discretize” • “kMeans ---> kMean” • “clustering --> cluster” • “solves ---> solve” • “classification ---> classify” • Example sentence • “direct predict continuous output variable method discretize • variable kMean cluster solve resultant classify problem”

Different Document Models

Probabilistic and Vector Space Model

Document Representation

Ontology and Semantic Enhancement of Presentation Models • Represent unstructured data (text documents) according to • ontology repository • Each term in a vector is a concept rather than only a word or phrase • Determine the similarity of documents • Methods to Represent Ontology • Terminological ontology • Synonyms: several words for the same concept • employee (HR)=staff (Administration)=researcher (R&D) • car=automobile • Homonyms: one word with several meanings • bank: river bank vs. financial bank • fan: cooling system vs. sports fan

Ontology-based VSM Each element of a document vector considering ontology is represented by: Where Xji1 is the original frequency of ti1 term in the jth document, is the semantic similarity between ti1 term and ti2 term . Advantage: Consider the relationship between terms. Introduce semantic concepts into data models. Combine ontology with the traditional VSM. Terms (dimensions) have semantic relationships, rather than independent

References • S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American society for Information Sciences, 27(3): 129-146, 1976. • Joshua Zhexue Huang1 & Michael Ng.2 & Liping Jing1,”Text Clustering: Algorithms, Semantics and Systems”,1 The University of Hong Kong, 2 Hong Kong Baptist University, PAKDD06 Tutorial, April 9, 2006, Singapore. • Hewijin Christine Jiau & Yi-Jen Su & Yeou-Min Lin & Shang-Rong Tsai, “MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data”, J IntellInfSyst (2006) 26: 185–207, DOI 10.1007/s10844-006-0250-2. • Michael W. Berry and MaluCastellanos; urvey of Text Mining: Clustering, Classification, and Retrieval, Second Edition;Springer; 2007; • A. Hotho, S. Staab, and G.Stumme, Text Clustering based on • background knowledge, TR425, AIFB, German, 2003

Document Clustering

Document Clustering

Presentation Transcript

Mixture Models for Document Clustering

Incorporating User Provided Constraints into Document Clustering

Web Document Clustering

Similarity Measures for Text Document Clustering

Document Clustering via Matrix Representation

Recursive Bipartite Spectral Clustering for Document Categorization

Dynamic hierarchical algorithms for document clustering

Web Mining: Phrase-based Document Indexing and Document Clustering

Web Document Clustering

Multitype Features Coselection for Web Document Clustering

Term and Document Clustering

Document Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM

Learning Element Similarity for XML Document Clustering

Fast Effective Clustering for Graphs and Document Collections

Document Clustering 文件分類

Web Document Clustering

Document Clustering with Prior Knowledge