1 / 33

Clustering of Web pages

Learn about clustering algorithms, link-based vs. text-based approaches, feature extraction, similarity measures, and issues in web page organization. Explore examples and techniques in web page clustering.

srucker
Download Presentation

Clustering of Web pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering of Web pages Najlah Gali 21.3.2017

  2. Web page clustering Organizing web pages into cohesive groups such that pages in the same cluster are more similar to each other than to those in other clusters. Entertainment Fitness

  3. Motivation

  4. Web search engines Finding similar or related web pages.

  5. Web page classification

  6. Queries’ similarity Two queries resulting in two different web pages within the same clusters can be recognized as being similar. Cluster Q 1 : Ravintola Q1 ≈ Q2 Q2: lounas

  7. How to cluster?

  8. Clustering components • Web page features • Words • Phrases • Links • Similarity measure • Semantic similarity • Syntactic similarity • Clustering algorithm • Partitional • Hierarchal • Graph based

  9. Approaches to cluster web pages Two approaches exist: • Link based: depends on the link structure between the pages • Common neighbor • Co-citation • Text based: depends on the content of the web page • Hyper based: depends on text and link structure

  10. Link-based clusteringcommon neighbor Two web pages are similar if they have neighbors in common. Similarity (a, b) = |O (a) ⋂ O |(b)| = |(c, d)| =2 a b In-link d f c e out-link

  11. Link-based clusteringCo-citation Two web pages are similar if they are referenced (cited) by similar pages. c f d e c g d b a e a b

  12. Co-citation analysis[Larson 1996] start Create a collection P1, P2, P3, P4… Construct co-citation frequency matrix Convert raw freq. into correlation matrix Multidimensional scaling technique Apply agglomerative clustering

  13. Co-citation examplePart 1 Retrieval strategy Collection P1 |Pages cite P1 and P2| P2 P3 P4 |Pages cite P1 and P3| P5 P6 Co citation matrix Correlation matrix

  14. Co-citation examplePart 2 Low correlation High correlation Correlation Matrix Cluster

  15. Issues (link-based clustering) It is useful when a web page lacks text content. However • Web pages with insufficient in-links or out-links can not be clustered; • Two web pages might be linked because they share a minor topic; • Links can be noisy (adverts); • No common links → similarity = 0!

  16. Text-based clustering • Content source • Entire text • Main content • Snippet • Keywords • Feature extraction • Binary • Term frequency (TF) • Term frequency-Inverse document frequency (TF-IDF) • Similarity measure • Character-based • Token-based • Clustering algorithm • Partitional (K-means) • Hierarchical (Agglomerative and divisive)

  17. Content source Keywords Office Equipment Supplies Shredder laminators Main content Snippet Entire text

  18. Feature extractionTokenization and stemming “Keep your office running smoothly with our wide…” • Tokenize into words Keep, your, office, running, smoothly, with, our, wide • Stem Keep, your, office, running, smoothli, with, our, wide

  19. Feature extractionStop words removal “Keep your office running smoothly with our wide…” Remove stop words (in, on, your, with, at) keep, offic, run, smoothli, wide

  20. Feature extractioncreation of feature vector Page 1: “Keep your office running smoothly with our wide…” Page 2: “..staffed office, keeping your office clean and staffed” Bag-of-words [keep, offic, run, smoothli, wide, staf, clean] • Binary vector : 1if occurs; 0 otherwise P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 1, 0, 0, 0, 1, 1] • TF vector: counts number of occurrence of a word win page p P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 2, 0, 0, 0, 2, 1]

  21. Term frequency-Inverse document frequency

  22. SimilarityMeasures • Character-based: treats strings as sequence of characters Single edit (insertion, deletion, substitution) is performed at a time to transfer a string into another • Q-gram: divides strings into substrings of length q • Token-based: treats strings as sequence of tokens Machine Learning mac, ach, chi, hin, ine, nel, ele, lea, ear, arn ... Machine Learning 1ifmatch 0otherwise Machine Learned • Hybrid: combines character- and token-based measures

  23. Token-based measures

  24. Results excellent good poor

  25. K-means start Select K random pages as centroids Assign other pages to nearest centroid N Converge? Y Calculate new centroids Stop

  26. Clustering algorithmsHierarchal 4 3 2 c d 1 a b 4 e 3 1 2 a b c d e

  27. Issues (text-based clustering) • Developed for use in small, static and homogenous pages; • Web pages lack text can not be clustered.

  28. Hyper-based clustering[Modha and Spangler 2000] Represent the page as a triple of unit vectors (D, F, B) • D : word frequencies in a page • F : Out-links • B : In-links Q e a g h m i j k c n l

  29. Out-links vector Bag-of nodes: pages that are pointed to by at least two pages in Q [g, i, j, m] Q e a g h m i j k c n l

  30. In-links vector Bag-of nodes: pages that points to least two pages in Q [e, h, k, c] Q e a g h m i k j c n l

  31. Similarity between two pages Cosine similarity

  32. References • Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data mining and knowledge discovery handbook (pp. 931-948). Springer US. • Larson, R. R. (1996, October). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting-American Society for Information Science (Vol. 33, pp. 71-78). • McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American society for information science, 41(6), 433. • Modha, D. S., & Spangler, W. S. (2000, May). Clustering hypertext with applications to web searching. In Proceedings of the eleventh ACM on Hypertext and hypermedia (pp. 143-152). ACM.

  33. Thank you!

More Related