160 likes | 287 Views
Density link-based methods for clustering web pages. Morteza Haghir Chehreghani , Hassan Abolhassani , Mostafa Haghir Chehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.
E N D
Density link-based methods for clustering web pages MortezaHaghirChehreghani, Hassan Abolhassani, MostafaHaghirChehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation • Web Informationis very useful for supporting decision making, but the information explosion on the web makes it hard to obtain required knowledge. • Effective web clustering facilitates relevant document retrieval that itself facilitates decision making. • High quality clustering, assists users to access relevant information much conveniently.
Objectives Using both content and link information on top of density based algorithms. Density based methods have the advantages of creating clusters in various shapes and removing the noisy data. Proposing a method using web hyperlink structure to find the dense units and also improve the joining process for creating hierarchical clusters.
Methodology • In this paper, Proposing two methods: • New density-based method • Density link-based method
Density-based Method:DBSCAN DBSCAN was the first density based algorithm, in which to create a new cluster or expand an existing one. A neighborhood distance with radius Eps must contain at least a minimum number of points denoted by MinPts.
New density-based method Clustering web data using only textual contents of documents. Extending the basic algorithm to use hyperlinks between the web documents.
New density-based method • The method has some limitations including: • A constant value for mutation to a higher level is not appropriate. A smaller value maybe appropriate for smaller clusters, but larger ones must take larger values. • It is developed for web data clustering, but it doesn't use hyperlink structure of the web. • Setting accurate values for parameters of the proposed method maybe difficult.
Link-based algorithm • Hyperlink structure brings some interesting ideas: • in combination with text content can help to construct hierarchical clusters with the link-based clusters as the base clusters • link structure can be a good suggestion to find dense units
Link-based algorithm • First step - Finding dense units • A subhyperlink structure is an LD_Unit if for each core node N inside the unit there is a subset of N's neighbors that: • it has at most MaxN members • sum of the similarities between N and the nodes of this subset is at least W. • Second step – Joining dense units • Node a is said to be external node of b if a and b do not exist in the same LD_Unit. W=2.5 and MaxN=4
Experiments Comparison of density based algorithms from different aspects.
Experiments Use of the density based method for clustering web pages.
Experiments Examination of link-based method for clustering web pages
Conclusions • The proposed method has the preference of low complexity(O(n*log n)) and the resultant clusters have high quality. • Revealing that link-based method has some preferences over the density based method.
Comments • Advantages • Low Complexity • High quality • Applications • Data Clustering