1 / 28

Clustering of Web Documents Jinfeng Chen

Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs , 2001. Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma, Learning to Cluster Web Search Results.

jered
Download Presentation

Clustering of Web Documents Jinfeng Chen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering of Web Documents Jinfeng Chen

  2. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results

  3. Correlation-based Document Clustering using Web Logs Introduction • Using web log data to construct clusters. • Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. • Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.

  4. DBSCAN • Does not require the user to pre-specify the number of clusters. • Only one scan through the database. • A radius value ε and a value Mpts. ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object

  5. DBSCAN algorithm(con’d) • Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do if o is not yet assigned to a cluster if o is a core-object then collect all objects density-reachable form o according to ε and MinPts assign them to a new cluster;

  6. Limitations of DBSCAN in Clustering of web document • Performance clustering using a fixed threshold value to determine “dense” regions in the document space. • Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.

  7. RDBC algorithm(recursive density based clustering) • Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points. • Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.

  8. RDBC algorithm(con’d) For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } else { ε= ε/2; Mpts=Mpts/4; RDBC (ε,Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε2 } }

  9. Construct WebPageSet from web logs • Step 1 • Step 2Delete visit of image files. • Step 3Extract sessions from the data.

  10. Construct WebPageSet (con’d) • Step 4 Create a distance matrix 1) Determine the size of a moving window, within which URL requests will be regarded as co-occurrence. 2) Calculate the co-occurrence times Ni,,j, and Ni, Nj of this pair of URL’s.

  11. Construct WebPageSet (con’d) • Step 4 Create a distance matrix 3) P(pi| pj)= Ni,j /Nj 4) Three Distance function

  12. Experimental Validation

  13. Conclusions • A new algorithm for clustering web documents based only on the log data. • It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN

  14. Learning to Cluster Web Search Results Introduction • This algorithm based on salient phrase come from documents contents. • Fast enough to be used in online calculation engine.

  15. Characteristics of Cluster web search results • Existing search engines such as Google ,Yahoo and MSN often return long list of search results. • Clustering of similar search results helps users find relevant results.

  16. Clustered Search results

  17. Conventional Search results

  18. Procedure of algorithm • Step 1: Search result fetching • Step 2: Document paring and Phrase property calculation • Step 3: Salient phrase ranking

  19. Search result fetching • Input a query to a conventional web search engine • Getting the webpage of results returned by engine. • Extracting the title and snippets.

  20. Document parsing • Step 1: Cleaning • Stemming (use Porter’ algorithm) • Sentence boundary identification • Step 2:Post-processing • Punctuation elimination • Filter out stop-words, ex: ‘too’ ‘are’ • Filter out query word • Ex: Microsoft software is available to students.

  21. Phrase property calculation • Five properties 1.Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1

  22. Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid • Here di={TFIDF1,TFIDF2,…}, • Each component of the vectors represents TFIDF of a phrase

  23. Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… with some “vectors” be…

  24. Learning to rank key phrases • Using Regression model to combine above five properties, calculating a single salience score for each phrase • Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y. • Here x=(TFIDF,LEN,ICS,CE,IND)

  25. Learning to rank key phrases • Three Regression • Linear Regression Logistic Regression • Support Vector Regression

  26. Evaluation

  27. Conclusions • Change the search result clustering problem to be a supervised salient phrase ranking problem. • Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.

  28. Thanks!

More Related