90 likes | 213 Views
Website Clustering. Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li. Traditional Lexical Features. Traditional website clustering uses lexical data parsed from each webpage to classify the websites into different categories. Regular text <TITLE> tags
E N D
Website Clustering Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li
Traditional Lexical Features • Traditional website clustering uses lexical data parsed from each webpage to classify the websites into different categories. • Regular text • <TITLE> tags • <META> tags (description, keywords, arthur) • What if the webpage consists of mainly automatically generated content from scripts? • What if the webpage is a empty frame page with two or more frame?
AOL Clickthrough Data • Back in August 2006, AOL released 2.2 GBs of search logs, which includes queries, clicked websites, and website page rank information. • brochures for business 5 http://www.hp.com • brochures for business 6 http://www.hansonmarketing.com • brochures for business 8 http://www.smallbusinessbrief.com • brochures for business 10 http://www.quickbrochures.com • brochures for business 9 http://www.smallbusinessbrief.com • brochures for business 7 http://www.printingforless.com
Query-Website Graph • We parsed a subset of this data to generate a query-document bipartite graph, where each link in the graph represents the number of times each query lead a website being clicked. Q1 Q2 Q3 Q4 Q5 Queries D1 D2 D3 D4 D5 Documents
Query-Website Graph • A graph like this is most likely too sparse to be useful. • There are a lot of unobserved ‘clicks’ between queries and other related webpages. • We use an iterative process to ‘smooth’ out the bipartite relationship between queries and websites, based on the observation that: • Documents are considered ‘similar’ to some extent if they have been seen by the same query. • Queries are considered ‘similar’ to some extent if they produce the same document.
Query-Website Graph • This will produce a more realistic query-website bipartite relationship • We can then use a list of queries associated with each website as a semantic feature vector. D1 D1 Q1 Q1 D2 D2 Q2 Q2 D3 D3
Combined Feature Vectors • We have three sets of feature vectors for each document: • Lexical features (consists of text and different html tags from the webpage itself) • Semantic features (consists of queries information related to each webpage) • Combination of both • There are 10000 words and 2000 queries – too many features.
Latent Semantic Analysis • We then apply Latent Semantic Analysis to reduce the 12000 features into a lower-ranked 30 ‘virtual concepts’ approximation • {Chicken, Beef, Apple, Oranges} -> {Meat, Fruits} • Each website is transformed from the original vector of features into a new vector of ‘virtual concepts’.
K-Means + Results • We then apply K-means on this new vector space to classify websites into different categories. • Results show that, while using only the semantic query vector performs worse than using the lexical feature vector, combining both features together results in a slightly better clustering performance. • Lexical + Semantic Query F1: 0.50 • Lexical only F1: 0.47 • Queries only F1: 0.30