170 likes | 323 Views
A Semi-supervised Document Clustering Algorithm based on EM. Leonardo Rigutini and Marco Maggini Department of Information Engineering University of Siena – Siena – Italy {rigutini,maggini}@dii.unisi.it. Outline. Document clustering and Semi-supervised clustering
A Semi-supervised Document Clustering Algorithm based on EM Leonardo Rigutini and Marco Maggini Department of Information Engineering University of Siena – Siena – Italy {rigutini,maggini}@dii.unisi.it
Outline • Document clustering and Semi-supervised clustering • EM algorithm and limitations • Using feature selection filtering to improve the EM algorithm • The proposed algorithm • Experimental results • Conclusions L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Document Clustering • Document clustering is a very hard task in Automatic Text Processing • It requires to extract regular patterns from a document collection without a priori knowledge on the category structure • Difficult task even for humans • many different but valid partitions may exist for the same collection • Lack of information about categories • Difficulty in using effective feature selection techniques to reduce the noise in the representation of texts L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Semi-supervised clustering • In between automatic categorization and auto-organization of data • A supervisor is not required to specify a set of classes, but to split a set of examples into groups • The initial examples are very few documents (from 1 to 10 at maximum) for each group • The initial examples could be also sets of keywords describing the desired groups L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Feature Selection • Document Clustering • Impossible to use global information to filter words (no information on classes is available): • IG, TS, DotRatio are not usable • In text representation it is a very important issue • Very high dimensional space representation • Distances between documents are very similar • Semi-supervised Clustering • An initial filtering can be performed using a small amount of initial information L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
EM Algorithm • A general algorithm to adjust the parameters of the model to the data distribution • E step: the unlabeled data are labeled by the classifier assuming the current configuration as correct • M step: the parameters of the classifier are re-estimated using the data labeled at the previous E-step, assuming the labels to be correct • The precedure is iterated until a convergence is reached L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
EM algorithm: limitations • The initialization of the classifier is an important issue for the correct final cluster composition • If the initial centroids are not distribuited as the final user would like, the algorithm can form clusters with a semantics not matching the user’s criteria • The iterative form of the EM algorithm produces a reinforcement effect on the badly labeled data • If at time t, in the expectation step (E), some documents are badly classified, these data influence the reestimation step (M) and at time t+1 other documents will be badly classified • This effect is increased with the successive iterations of the E-M steps L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Distribution of distances • The distance between two similar documents is very close to the one between two dissimilar documents • It is very probable that the E step badly labels some boundary documents • EM reaches a trivial solution very often: • A large central cluster including the major part of the documents • Various peripheral small clusters including outliers L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Feature Selection • At each iteration of EM, the badly labeled data influence the reestimation of the parameters, moving the centroids to a wrong direction • We can reduce the influence of bad labeled documents in the M step using a feature selction filtering in the EM algorithm • We use the labeled dataset produced by the E step to filter out the not significative words for each class • In this way, the noisy words introduced by the badly classified documents in the E step, will not contribute to the M step L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
The proposed algorithm • ssads L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
The algorithm • The small initial labeled dataset is used to initialize the parameters of the classifier in the EM algorithm • To extract the most significative words from the training dataset an Information Gain filter IG1 is used • Once the unlabeled data have been labeled, the Information Gain filter IG2 avoids that wrong documents influence the reestimation step • The algorithm ends when the confusion matrix does not change in two successive iterations L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Experimental results • Dataset: • We download about 24.000 messages from English newsgroups • Three different groups • Auto • Hardware • Sport • We divided the dataset into 2 subsets • Init repository to pick up the start documents • Unlabeled datadocuments to cluster L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Experimental results • We decided to test the algorithm with 4 different initial configurations:1,3,5 and 7 starting documents random sampled from the initial dataset • All results are averaged on a ten fold cross-validation • Baseline: • K-means on the unlabeled data initialized with the initial dataset • Proposed algorithm • To speed up the clustering task, we ran the algorithm on a subset of unlabeled data and then we used the trained classifier to categorize the remaining unlabeled data • Two size for the small unlabeled dataset: 100 and 300 documents L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Baseline experiment • K-means on the unlabeled dataset initialized with 1,3,5 and 7 documents • The poor performance depends on the fact that no regularization can be applied for the k-means algorithm and an assignment of a document to a wrong cluster produces a movement of the centroids of the two clusters which reinforces the wrong assignment L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Proposed algorithm: test 1 • Proposed algorithm • 1,3,5 and 7 documents to inizialise the classifier • k1=100 and k2=1000 for IG filters • 100 documents in the unlabeled dataset L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Proposed algorithm: test 2 • Proposed algorithm • 1,3,5 and 7 documents to inizialise the classifier • k1=100 and k2=1000 for IG filters • 300 documents in the unlabeled dataset L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005
Conclusions • We presented a semi-supervised version of the EM algorithm for document clustering • It uses an initial small amount of knowledge to guide the EM algorithm in forming the clusters • The system partitions a large collection of documents providing a small initial amount of information about the clusters (for example some keywords describing each cluster) and it shows quite good results • The novel proposal is mainly the use of a regularization step which exploits a feature selection technique in an EM algorithm • With a different initialization technique which does not require the supervision of a human expert, the algorithm could be completely unsupervised L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005