330 likes | 349 Views
Learn about clustering algorithms, link-based vs. text-based approaches, feature extraction, similarity measures, and issues in web page organization. Explore examples and techniques in web page clustering.
E N D
Clustering of Web pages Najlah Gali 21.3.2017
Web page clustering Organizing web pages into cohesive groups such that pages in the same cluster are more similar to each other than to those in other clusters. Entertainment Fitness
Web search engines Finding similar or related web pages.
Queries’ similarity Two queries resulting in two different web pages within the same clusters can be recognized as being similar. Cluster Q 1 : Ravintola Q1 ≈ Q2 Q2: lounas
Clustering components • Web page features • Words • Phrases • Links • Similarity measure • Semantic similarity • Syntactic similarity • Clustering algorithm • Partitional • Hierarchal • Graph based
Approaches to cluster web pages Two approaches exist: • Link based: depends on the link structure between the pages • Common neighbor • Co-citation • Text based: depends on the content of the web page • Hyper based: depends on text and link structure
Link-based clusteringcommon neighbor Two web pages are similar if they have neighbors in common. Similarity (a, b) = |O (a) ⋂ O |(b)| = |(c, d)| =2 a b In-link d f c e out-link
Link-based clusteringCo-citation Two web pages are similar if they are referenced (cited) by similar pages. c f d e c g d b a e a b
Co-citation analysis[Larson 1996] start Create a collection P1, P2, P3, P4… Construct co-citation frequency matrix Convert raw freq. into correlation matrix Multidimensional scaling technique Apply agglomerative clustering
Co-citation examplePart 1 Retrieval strategy Collection P1 |Pages cite P1 and P2| P2 P3 P4 |Pages cite P1 and P3| P5 P6 Co citation matrix Correlation matrix
Co-citation examplePart 2 Low correlation High correlation Correlation Matrix Cluster
Issues (link-based clustering) It is useful when a web page lacks text content. However • Web pages with insufficient in-links or out-links can not be clustered; • Two web pages might be linked because they share a minor topic; • Links can be noisy (adverts); • No common links → similarity = 0!
Text-based clustering • Content source • Entire text • Main content • Snippet • Keywords • Feature extraction • Binary • Term frequency (TF) • Term frequency-Inverse document frequency (TF-IDF) • Similarity measure • Character-based • Token-based • Clustering algorithm • Partitional (K-means) • Hierarchical (Agglomerative and divisive)
Content source Keywords Office Equipment Supplies Shredder laminators Main content Snippet Entire text
Feature extractionTokenization and stemming “Keep your office running smoothly with our wide…” • Tokenize into words Keep, your, office, running, smoothly, with, our, wide • Stem Keep, your, office, running, smoothli, with, our, wide
Feature extractionStop words removal “Keep your office running smoothly with our wide…” Remove stop words (in, on, your, with, at) keep, offic, run, smoothli, wide
Feature extractioncreation of feature vector Page 1: “Keep your office running smoothly with our wide…” Page 2: “..staffed office, keeping your office clean and staffed” Bag-of-words [keep, offic, run, smoothli, wide, staf, clean] • Binary vector : 1if occurs; 0 otherwise P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 1, 0, 0, 0, 1, 1] • TF vector: counts number of occurrence of a word win page p P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 2, 0, 0, 0, 2, 1]
SimilarityMeasures • Character-based: treats strings as sequence of characters Single edit (insertion, deletion, substitution) is performed at a time to transfer a string into another • Q-gram: divides strings into substrings of length q • Token-based: treats strings as sequence of tokens Machine Learning mac, ach, chi, hin, ine, nel, ele, lea, ear, arn ... Machine Learning 1ifmatch 0otherwise Machine Learned • Hybrid: combines character- and token-based measures
Results excellent good poor
K-means start Select K random pages as centroids Assign other pages to nearest centroid N Converge? Y Calculate new centroids Stop
Clustering algorithmsHierarchal 4 3 2 c d 1 a b 4 e 3 1 2 a b c d e
Issues (text-based clustering) • Developed for use in small, static and homogenous pages; • Web pages lack text can not be clustered.
Hyper-based clustering[Modha and Spangler 2000] Represent the page as a triple of unit vectors (D, F, B) • D : word frequencies in a page • F : Out-links • B : In-links Q e a g h m i j k c n l
Out-links vector Bag-of nodes: pages that are pointed to by at least two pages in Q [g, i, j, m] Q e a g h m i j k c n l
In-links vector Bag-of nodes: pages that points to least two pages in Q [e, h, k, c] Q e a g h m i k j c n l
Similarity between two pages Cosine similarity
References • Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data mining and knowledge discovery handbook (pp. 931-948). Springer US. • Larson, R. R. (1996, October). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting-American Society for Information Science (Vol. 33, pp. 71-78). • McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American society for information science, 41(6), 433. • Modha, D. S., & Spangler, W. S. (2000, May). Clustering hypertext with applications to web searching. In Proceedings of the eleventh ACM on Hypertext and hypermedia (pp. 143-152). ACM.