A Comparison of SOM Based Document Categorization Systems

A Comparison of SOM Based Document Categorization Systems Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors :X. Luo, A. Nur Zincir-Heywood 2003 IEEE .

Outline • Motivation • Objective • Architecture Overview • Performance Evaluation • Conclusions • Personal Opinion

Motivation • Document categorization systems can solution two problems • Information overload • Describe the constant influx of new information, which causes user to be overwhelmed by the subject and system knowledge required to access this information • Vocabulary differences • automatic selection and weighing of keywords in text documents may well bias the nature of the clusters found at later stage.

word1 word1 word2 ……. ……. ……. wordn Objective • This paper describes the development and evaluation of two unsupervised learning mechanisms for solving the automatic document categorization problem. • Vector space model • Code-books model

Introduction • A common approach among existing systems is to cluster based upon their word distributions, while word clustering is determined by document co-occurrence. • Vector Space Model (VSM) • The frequency of occurrence of each word in each document is recorded • Generally weighted using the Term Frequency (TF) multiplied by the Inverse Document Frequency (IDF)

Introduction (cont.) • First clustering system • Built is based on the VSM • and makes use of topological ordering property of SOMs. • Second clustering system • Make use of the SOM based architecture as an encoder for data representation • By finding a smaller set of prototypes from a large input space – without using the typical information retrieval pre-processing • Consider the relationships between characters, then words and finally word co-occurrences

Document Clustering With Self-Organizing Maps (cont.) • First step is the identification of an encoding of the original information such that pertinent features may be decoded most efficiently. • SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes. Noise v BMU Input vector x Σ Encoder c(x) Reconstruction vector x’ Decoder x’(x) Weight vector, wj

Data collect Data reduction Pattern Discovery Data preprocess Architecture Overview • There are two main parts to the vector space model • Parsing • 1)Converts documents into a succession of words • 2)Using a basic Stop-List of common English words • 3)A stemming algorithm is then applied to the remaining words ex: story 和 stories, 或者First 和 first • Indexing • 1)Each document is represented in the Vector Space Model • 2)The frequency of occurrence of each word in each document is recorded • 3)using TF multiplied by IDF to generate weighted value

Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words occurring less than 5 times are removed TF/IDF is used to weight each word Architecture-1 :Emphasizing Density Matching Property • Data Reduction • Randomly selected to reduced data dimensions x’ = Rx • x − the original data vector, where x  RN • R − A matrix consisting of random values where the Euclidean length of each column has been normalized to unity • x’ − Reduced-dimensional or quantized vector, where x’  Rd Data pre-processing

Architecture-1 :Emphasizing Density Matching Property (cont.) Data Preprocess Data Reduction With crowed neurons may solve the problem by a divide and conquer method

Architecture-2 :Emphasizing Encoding-Decoding Property • The core of the approach is to automate the identification of typical category characteristics. • A document is summarized by its words and their frequencies (TF) in descending order • An SOM be used to identify a suitable character encoding, thenword encoding, and word co- occurrence encoding Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words frequencies are formed Words occurring less than 5 times are removed Data pre-processing

Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the First-level SOMs • Employs a three level hierarchical SOM architecture, characters, words, and word co-occurrences. • Characters be represented by their ASCII code • The relationships between characters are represented by a character’s position, or time index, in a word • Examplenews: n1 , e2, w3, s4ASCII: n->14, e5, w23, s19 • Pre-processing process • Convert the word’s characters to numerical • Find the time indices of the characters • Linearly normalize the indices , so thatthe first character is one, second is two.

Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the second-level SOMs • For each word, k, that is input to the first-level SOM of each document, • Form a vector of size equal to the number of neurons (r) in the first-level SOM • For each character of k • Observe which neurons n1, n2,…nr are affected the most (the first 3 BMUs) • Increment entries in the vector corresponding to the first 3 BMUs by 1/j, 1 <= j <= 3

Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the Third-level SOMs • The third-level input vectors are built using BMUs resulting from word vectors passed through the second-level SOMs.

Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Training the SOMs • Initialization: • Choose random values for the initial weight vectors wj(0), j=1, 2,…,l where l is the number of neurons in the map • Sampling: • Draw a sample x from the input space with a uniform probability • Similarity matching: • Find the best matching neuron i(x) using the Euclidean criterion • Updating: Adjust the weight vectors of all neurons by using the update formula

A Performance Evaluation • The performance measurement used is based on • A－set of correct class labels(answer key) • B－baseline clusters where each document is one cluster • C－set of clusters (‘winning’ clustering results) • dist(C, A)－The number of operations required to transform C into A • dist(B, A)－The number of operations required to transform B into A

Conclusion • First architecture emphasizes a layered approach to lower the computational cost of the training of the map and employs a random mapping to decrease the dimension of the input space. • The second architecture is based on a new idea where the SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes • Future work • Develop a classifier, which will work in conjunction to these clustering systems • Apply the technique to a wider cross-section of benchmark data set

Personal Opinions • Advantages • Using random mapping method to reduced dimensions. • Drawback • Architecture description is not clear. • Limit • …

A Comparison of SOM Based Document Categorization Systems

A Comparison of SOM Based Document Categorization Systems

Presentation Transcript

Dataware’s Document Categorization Toolkit

Document Classification Comparison

Entity Categorization Over Large Document Collections

Categorization based on Company Activities

Comparison Systems

A Categorization of Contextual Constraints

Recursive Bipartite Spectral Clustering for Document Categorization

Entity Categorization Over Large Document Collections

Comparison of vertebrate systems

Comparison of VLE systems

Document Categorization Issues

A Concept-based Model for Enhancing Text Categorization

A Comparison of Document, Sentence, and Term Event Spaces

A Text Categorization Based on summarization Technique

Medical Document Categorization Using a Priori Knowledge

Document Categorization

Text Document Categorization by Term Association

Comparison of Guidance Systems

Document Based Question

A Study of Text Categorization