280 likes | 481 Views
Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs , 2001. Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma, Learning to Cluster Web Search Results.
E N D
Clustering of Web Documents Jinfeng Chen
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results
Correlation-based Document Clustering using Web Logs Introduction • Using web log data to construct clusters. • Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. • Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.
DBSCAN • Does not require the user to pre-specify the number of clusters. • Only one scan through the database. • A radius value ε and a value Mpts. ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object
DBSCAN algorithm(con’d) • Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do if o is not yet assigned to a cluster if o is a core-object then collect all objects density-reachable form o according to ε and MinPts assign them to a new cluster;
Limitations of DBSCAN in Clustering of web document • Performance clustering using a fixed threshold value to determine “dense” regions in the document space. • Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.
RDBC algorithm(recursive density based clustering) • Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points. • Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.
RDBC algorithm(con’d) For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } else { ε= ε/2; Mpts=Mpts/4; RDBC (ε,Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε2 } }
Construct WebPageSet from web logs • Step 1 • Step 2Delete visit of image files. • Step 3Extract sessions from the data.
Construct WebPageSet (con’d) • Step 4 Create a distance matrix 1) Determine the size of a moving window, within which URL requests will be regarded as co-occurrence. 2) Calculate the co-occurrence times Ni,,j, and Ni, Nj of this pair of URL’s.
Construct WebPageSet (con’d) • Step 4 Create a distance matrix 3) P(pi| pj)= Ni,j /Nj 4) Three Distance function
Conclusions • A new algorithm for clustering web documents based only on the log data. • It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN
Learning to Cluster Web Search Results Introduction • This algorithm based on salient phrase come from documents contents. • Fast enough to be used in online calculation engine.
Characteristics of Cluster web search results • Existing search engines such as Google ,Yahoo and MSN often return long list of search results. • Clustering of similar search results helps users find relevant results.
Procedure of algorithm • Step 1: Search result fetching • Step 2: Document paring and Phrase property calculation • Step 3: Salient phrase ranking
Search result fetching • Input a query to a conventional web search engine • Getting the webpage of results returned by engine. • Extracting the title and snippets.
Document parsing • Step 1: Cleaning • Stemming (use Porter’ algorithm) • Sentence boundary identification • Step 2:Post-processing • Punctuation elimination • Filter out stop-words, ex: ‘too’ ‘are’ • Filter out query word • Ex: Microsoft software is available to students.
Phrase property calculation • Five properties 1.Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1
Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid • Here di={TFIDF1,TFIDF2,…}, • Each component of the vectors represents TFIDF of a phrase
Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… with some “vectors” be…
Learning to rank key phrases • Using Regression model to combine above five properties, calculating a single salience score for each phrase • Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y. • Here x=(TFIDF,LEN,ICS,CE,IND)
Learning to rank key phrases • Three Regression • Linear Regression Logistic Regression • Support Vector Regression
Conclusions • Change the search result clustering problem to be a supervised salient phrase ranking problem. • Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.