Clustering of Web Documents Jinfeng Chen

Clustering of Web Documents Jinfeng Chen

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results

Correlation-based Document Clustering using Web Logs Introduction • Using web log data to construct clusters. • Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. • Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.

DBSCAN • Does not require the user to pre-specify the number of clusters. • Only one scan through the database. • A radius value ε and a value Mpts. ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object

DBSCAN algorithm(con’d) • Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do if o is not yet assigned to a cluster if o is a core-object then collect all objects density-reachable form o according to ε and MinPts assign them to a new cluster;

Limitations of DBSCAN in Clustering of web document • Performance clustering using a fixed threshold value to determine “dense” regions in the document space. • Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.

RDBC algorithm(recursive density based clustering) • Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points. • Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.

RDBC algorithm(con’d) For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } else { ε= ε/2; Mpts=Mpts/4; RDBC (ε,Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε2 } }

Construct WebPageSet from web logs • Step 1 • Step 2Delete visit of image files. • Step 3Extract sessions from the data.

Construct WebPageSet (con’d) • Step 4 Create a distance matrix 1) Determine the size of a moving window, within which URL requests will be regarded as co-occurrence. 2) Calculate the co-occurrence times Ni,,j, and Ni, Nj of this pair of URL’s.

Construct WebPageSet (con’d) • Step 4 Create a distance matrix 3) P(pi| pj)= Ni,j /Nj 4) Three Distance function

Experimental Validation

Conclusions • A new algorithm for clustering web documents based only on the log data. • It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN

Learning to Cluster Web Search Results Introduction • This algorithm based on salient phrase come from documents contents. • Fast enough to be used in online calculation engine.

Characteristics of Cluster web search results • Existing search engines such as Google ,Yahoo and MSN often return long list of search results. • Clustering of similar search results helps users find relevant results.

Clustered Search results

Conventional Search results

Procedure of algorithm • Step 1: Search result fetching • Step 2: Document paring and Phrase property calculation • Step 3: Salient phrase ranking

Search result fetching • Input a query to a conventional web search engine • Getting the webpage of results returned by engine. • Extracting the title and snippets.

Document parsing • Step 1: Cleaning • Stemming (use Porter’ algorithm) • Sentence boundary identification • Step 2:Post-processing • Punctuation elimination • Filter out stop-words, ex: ‘too’ ‘are’ • Filter out query word • Ex: Microsoft software is available to students.

Phrase property calculation • Five properties 1.Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1

Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid • Here di={TFIDF1,TFIDF2,…}, • Each component of the vectors represents TFIDF of a phrase

Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… with some “vectors” be…

Learning to rank key phrases • Using Regression model to combine above five properties, calculating a single salience score for each phrase • Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y. • Here x=(TFIDF,LEN,ICS,CE,IND)

Learning to rank key phrases • Three Regression • Linear Regression Logistic Regression • Support Vector Regression

Evaluation

Conclusions • Change the search result clustering problem to be a supervised salient phrase ranking problem. • Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.

Thanks!

Clustering of Web Documents Jinfeng Chen

Clustering of Web Documents Jinfeng Chen

Presentation Transcript

Web Document Clustering

Clustering for web documents

Web documents types

Clustering tagged documents with labeled and unlabeled documents

Clustering Documents

Clustering Documents

Web Document Clustering

Organisation of documents on Web

Network-Aware Clustering of Web Clients

Clustering Web Queries

Creating Web Documents

Quality of web documents

Pseudo-supervised Clustering for Text Documents

Documents of Government Web-quest

Online Clustering of Web Search results

Creating Web documents

Clustering Documents in a Web Directory

Web Service Clustering

Web clustering Engines

Web Document Clustering

Clustering of Web pages