Matjaž Juršič, Vid Podpe čan , Nada Lavrač

http://kt.jis.si Fuzzy Clustering of Documents Matjaž Juršič, Vid Podpečan, Nada Lavrač

1/13 Overview Basic Concepts - Clustering - Fuzzy Clustering - Clustering of Documents Problem Domain - Conference Papers Clustering (Phase 1) - Combining Constraint-Based & Fuzzy Clustering - Conference Papers Clustering (Phase 2) Fuzzy Clustering of Documents - C-Means Algorithm - Distance Measure - Comparison of Crisp & Fuzzy Clustering - Time Complexity Further Work Fuzzy Clustering of Documents

2/13 Clustering • Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data. • Dividing data into groups (clusters) such that: • - “similar” objects are in the same cluster, • - “dissimilar” objects are in different clusters. • Problems: • - correct similarity/distance function between objects, • - evaluating clustering results. Fuzzy Clustering of Documents

3/13 Fuzzy Clustering • No sharp boundaries between clusters. • Each data object can belong to more than one cluster (with certain probability). e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster Fuzzy Clustering of Documents

4/13 • Clustering of Documents • Bag of Words & Vector Space Model • - text represented as an unordered collection of words • - using tf-idf (term frequency–inversedocumentfrequency) • - document = one vector in high dimensional space • - similarity = cosine similarity between vectors • Text-Garden Software Library (www.textmining.net) • - collection of text-minig software tools • (text analysis, model generation, documents classification/clustering, web crawling, ...) • - c++ library • - developed at JSI Fuzzy Clustering of Documents

5/13 • Conference Papers Clustering (Phase 1) Problem Grouping conference papers with regard to their contents into predefined sessions schedule. Sessions schedule Example Session A – Title Session D – Title Session A (3 papers) Constraint-based clustering Papers Coffee break Session B (4 papers) Session C – Title Session B – Title Lunch break Session C (4 papers) Coffee break Session D (3 papers) Fuzzy Clustering of Documents

6/13 • Combining Constraint-Based & Fuzzy Clustering Phase 1 Solution - constrained-based clustering (CBC) Difficulties - CBC can get stuck in local minimum - often low quality result (created schedule) - user interaction needed to repair schedule Phase 2 Needed - run fuzzy clustering (FC) with initial clusters from CBC - if output clusters of FC differ from CBC repeat everything - if the clusters of FC equal to CBC show new info to user Fuzzy Clustering of Documents

7/13 • Conference Papers Clustering (Phase 2) Run Fuzzy Clustering on Phase 1 Results - insight into result quality - identify problematic papers Sessions schedule Example Session A – Title Session D – Title 25% Coffee break Session B – Title Session C – Title 10% 42% Lunch break 13% 37% Coffee break Fuzzy Clustering of Documents

8/13 • C-Means Algorithm • generate initial(random) clusters centres • repeat • for each example calculate membership weights • for each cluster recompute new centre • until the difference of the clusters between two iterations drops under some threshold Fuzzy Clustering of Documents

9/13 • Distance Measure Vector Space - Usual similarity measure: cosine similarity C-Means explicitly needs distance (dissimilarity), not similarity: - There are many possibilities: - None has ideal properties. - Experimental evaluation shows no significant difference. - We used Fuzzy Clustering of Documents

10/13 • Comparison of Crisp & Fuzzy Clustering Fuzzy Clustering of Documents

11/13 • Time Complexity If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering). Fuzzy Clustering of Documents

12/13 • Further Work Evaluation - Test scenarios - Benchmarks - Using data from past conferences User Interface - Web interface for semi-automatic conference schedule creation Algorithms Fine-Tuning … Fuzzy Clustering of Documents

Discussion contacts matjaz.jursic@ijs.si, vid.podpecan@ijs.si, nada.lavrac@ijs.si Thank you for your attention

Matjaž Juršič, Vid Podpe čan , Nada Lavrač

Matjaž Juršič, Vid Podpe čan , Nada Lavrač

Presentation Transcript

By nada

Nada Chaiyajit

NADA

Nada Visits Lebanon