1 / 59

Advanced Multimedia

Advanced Multimedia. Text Clustering Tamara Berg. Reminder - Classification. Given some labeled training documents Determine the best label for a test (query) document. What if we don’t have labeled data?. We can’t do classification. What if we don’t have labeled data?.

holmes-soto
Download Presentation

Advanced Multimedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Multimedia Text Clustering Tamara Berg

  2. Reminder - Classification • Given some labeled training documents • Determine the best label for a test (query) document

  3. What if we don’t have labeled data? • We can’t do classification.

  4. What if we don’t have labeled data? • We can’t do classification. • What can we do? • Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

  5. What if we don’t have labeled data? • We can’t do classification. • What can we do? • Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. • Often similarity is assessed according to a distance measure.

  6. What if we don’t have labeled data? • We can’t do classification. • What can we do? • Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. • Often similarity is assessed according to a distance measure. • Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

  7. Any of the similarity metrics we talked about before (SSD, angle between vectors)

  8. Document Clustering Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar.

  9. Source: Hinrich Schutze

  10. Source: Hinrich Schutze

  11. Source: Hinrich Schutze

  12. Source: Hinrich Schutze

  13. Source: Hinrich Schutze

  14. Google news Flickr Clusters Source: Hinrich Schutze

  15. Source: Hinrich Schutze

  16. How to cluster Documents

  17. Reminder - Vector Space Model • Documents are represented as vectors in term space • A vector distance/similarity measure between two documents is used to compare documents Slide from Mitch Marcus

  18. Document Vectors:One location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

  19. Document Vectors Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 Slide from Mitch Marcus

  20. TF x IDF Calculation A Slide from Mitch Marcus

  21. Features Define whatever features you like: Length of longest string of CAP’s Number of $’s Useful words for the task … A

  22. Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ? Which pair of documents are the most similar?

  23. Source: Hinrich Schutze

  24. source: Dan Klein

  25. K-means clustering • Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk source: Svetlana Lazebnik

  26. K-means clustering • Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk source: Svetlana Lazebnik

  27. source: Dan Klein

  28. source: Dan Klein

  29. source: Dan Klein

  30. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: Source: Hinrich Schutze

  31. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. Source: Hinrich Schutze

  32. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. • (because each vector is moved to a closer centroid) Source: Hinrich Schutze

  33. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. • (because each vector is moved to a closer centroid) • RSS decreases during recomputation. Source: Hinrich Schutze

  34. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. • (because each vector is moved to a closer centroid) • RSS decreases during recomputation. • Thus: We must reach a fixed point. Source: Hinrich Schutze

  35. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. • (because each vector is moved to a closer centroid) • RSS decreases during recomputation. • Thus: We must reach a fixed point. • But we don’t know how long convergence will take! Source: Hinrich Schutze

  36. Convergence of K Means • K-means converges to a fixed point in a finite number of iterations. Proof: • The sum of squared distances (RSS) decreases during reassignment. • (because each vector is moved to a closer centroid) • RSS decreases during recomputation. • Thus: We must reach a fixed point. • But we don’t know how long convergence will take! • If we don’t care about a few docs switching back and forth, then convergence is usually fast (< 10-20 iterations). Source: Hinrich Schutze

  37. source: Dan Klein

More Related