180 likes | 303 Views
A Comparison of SOM Based Document Categorization Systems. Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors : X. Luo, A. Nur Zincir-Heywood . 2003 IEEE. Outline. Motivation Objective Architecture Overview Performance Evaluation Conclusions Personal Opinion.
E N D
A Comparison of SOM Based Document Categorization Systems Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors :X. Luo, A. Nur Zincir-Heywood 2003 IEEE .
Outline • Motivation • Objective • Architecture Overview • Performance Evaluation • Conclusions • Personal Opinion
Motivation • Document categorization systems can solution two problems • Information overload • Describe the constant influx of new information, which causes user to be overwhelmed by the subject and system knowledge required to access this information • Vocabulary differences • automatic selection and weighing of keywords in text documents may well bias the nature of the clusters found at later stage.
word1 word1 word2 ……. ……. ……. wordn Objective • This paper describes the development and evaluation of two unsupervised learning mechanisms for solving the automatic document categorization problem. • Vector space model • Code-books model
Introduction • A common approach among existing systems is to cluster based upon their word distributions, while word clustering is determined by document co-occurrence. • Vector Space Model (VSM) • The frequency of occurrence of each word in each document is recorded • Generally weighted using the Term Frequency (TF) multiplied by the Inverse Document Frequency (IDF)
Introduction (cont.) • First clustering system • Built is based on the VSM • and makes use of topological ordering property of SOMs. • Second clustering system • Make use of the SOM based architecture as an encoder for data representation • By finding a smaller set of prototypes from a large input space – without using the typical information retrieval pre-processing • Consider the relationships between characters, then words and finally word co-occurrences
Document Clustering With Self-Organizing Maps (cont.) • First step is the identification of an encoding of the original information such that pertinent features may be decoded most efficiently. • SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes. Noise v BMU Input vector x Σ Encoder c(x) Reconstruction vector x’ Decoder x’(x) Weight vector, wj
Data collect Data reduction Pattern Discovery Data preprocess Architecture Overview • There are two main parts to the vector space model • Parsing • 1)Converts documents into a succession of words • 2)Using a basic Stop-List of common English words • 3)A stemming algorithm is then applied to the remaining words ex: story 和 stories, 或者First 和 first • Indexing • 1)Each document is represented in the Vector Space Model • 2)The frequency of occurrence of each word in each document is recorded • 3)using TF multiplied by IDF to generate weighted value
Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words occurring less than 5 times are removed TF/IDF is used to weight each word Architecture-1 :Emphasizing Density Matching Property • Data Reduction • Randomly selected to reduced data dimensions x’ = Rx • x − the original data vector, where x RN • R − A matrix consisting of random values where the Euclidean length of each column has been normalized to unity • x’ − Reduced-dimensional or quantized vector, where x’ Rd Data pre-processing
Architecture-1 :Emphasizing Density Matching Property (cont.) Data Preprocess Data Reduction With crowed neurons may solve the problem by a divide and conquer method
Architecture-2 :Emphasizing Encoding-Decoding Property • The core of the approach is to automate the identification of typical category characteristics. • A document is summarized by its words and their frequencies (TF) in descending order • An SOM be used to identify a suitable character encoding, thenword encoding, and word co- occurrence encoding Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words frequencies are formed Words occurring less than 5 times are removed Data pre-processing
Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the First-level SOMs • Employs a three level hierarchical SOM architecture, characters, words, and word co-occurrences. • Characters be represented by their ASCII code • The relationships between characters are represented by a character’s position, or time index, in a word • Examplenews: n1 , e2, w3, s4ASCII: n->14, e5, w23, s19 • Pre-processing process • Convert the word’s characters to numerical • Find the time indices of the characters • Linearly normalize the indices , so thatthe first character is one, second is two.
Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the second-level SOMs • For each word, k, that is input to the first-level SOM of each document, • Form a vector of size equal to the number of neurons (r) in the first-level SOM • For each character of k • Observe which neurons n1, n2,…nr are affected the most (the first 3 BMUs) • Increment entries in the vector corresponding to the first 3 BMUs by 1/j, 1 <= j <= 3
Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the Third-level SOMs • The third-level input vectors are built using BMUs resulting from word vectors passed through the second-level SOMs.
Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Training the SOMs • Initialization: • Choose random values for the initial weight vectors wj(0), j=1, 2,…,l where l is the number of neurons in the map • Sampling: • Draw a sample x from the input space with a uniform probability • Similarity matching: • Find the best matching neuron i(x) using the Euclidean criterion • Updating: Adjust the weight vectors of all neurons by using the update formula
A Performance Evaluation • The performance measurement used is based on • A-set of correct class labels(answer key) • B-baseline clusters where each document is one cluster • C-set of clusters (‘winning’ clustering results) • dist(C, A)-The number of operations required to transform C into A • dist(B, A)-The number of operations required to transform B into A
Conclusion • First architecture emphasizes a layered approach to lower the computational cost of the training of the map and employs a random mapping to decrease the dimension of the input space. • The second architecture is based on a new idea where the SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes • Future work • Develop a classifier, which will work in conjunction to these clustering systems • Apply the technique to a wider cross-section of benchmark data set
Personal Opinions • Advantages • Using random mapping method to reduced dimensions. • Drawback • Architecture description is not clear. • Limit • …