240 likes | 625 Views
V. Clustering. 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93. Outline. V.1 Clustering tasks in text analysis V.2 The general clustering problem V.3 Clustering algorithm V.4 Clustering of textual data. Clustering. Clustering
E N D
V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93
Outline • V.1 Clustering tasks in text analysis • V.2 The general clustering problem • V.3 Clustering algorithm • V.4 Clustering of textual data
Clustering • Clustering • An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) • Data mining, document retrieval, image segmentation, pattern classification.
V.1 Clustering tasks in text analysis(1/2) • Cluster hypothesis “Relevant documents tend to be more similar to each other than to nonrelevant ones.” • If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. • Improving search recall • When a query matches a document its whole cluster can be return • Improving search precision • By grouping the document into a much smaller number of groups of related documents
V.1 Clustering tasks in text analysis(2/2) • Scatter/gather browsing method • Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. • Session1: a document collection is scattered into a set of clusters. • Sesson2: then the selected clusters are gatheredinto a new subcollection with which the process may be repeated. • 참고사이트 • http://www2.parc.com/istl/projects/ia/sg-background.html • Query-Specific clustering are also possible. - the hierarchical clustering is appealing
V.2 Clustering problem(1/2) • Cluster tasks • problem representation • definition proximity measures • actual clustering of objects • data abstraction • evalutation • Problem representation • Basically, optimization problem. • Goal: select the best among all possible groupings of objects • Similarity function: clustering quality function. • Feature extraction/ feature selection • In a vector space model, • objects: vectors in the high-dimensional feature space. • the similarity function: the distance between the vectors in some metric
V.2 Clustering problem(2/2) • Similarity Measures • Euclidian distance • Cosine similarity measure is the most common
V.3 Clustering algorithm (1/9) • flat clustering: a single partition of a set of objects into disjoint groups. • hierarchical clustering: a nested series of partition. • hard clustering: every objects may belongs to exactly one cluster. • soft clustering: objects may belongs to several clusters with a fractional degree of membership in each.
V.3 Clustering algorithm (2/9) • Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. • Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. • Shuffling algorithm: iteratively redistribute objects in clusters
V.3 Clustering algorithm (3/9) • k-means algorithm(1/2) • hard, flat, shuffling algorithm
V.3 Clustering algorithm (4/9) • example of K-means algorithm
V.3 Clustering algorithm (5/9) • K-means algorithm(2/2) • Simple, efficient • Complexity O(kn) • bad initial selection of seeds.-> local optimal. • k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm • Maximizes the quality function Q:
V.3 Clustering algorithm (6/9) • EM-based probabilistic clustering algorithm(1/2) • Soft, flat, probabilistic
V.3 Clustering algorithm (8/9) • Hierarchical agglomerative Clustering • single-link method • Complete-link method • Average-link method
Other clustering algorithms • minimal spanning tree • nearest neighbor clustering • Buckshot algorithm
V.4 clustering of textual data(1/6) • representation of text clustering problem • Objects are very complex and rich internal structure. • Documents must be converted into vectors in the feature space. • Bag-of-words document representation. • Reducing the dimensionality • Local method: delete unimportant components from individual document vectors. • Global method: latent semantic indexing(LSI)
V.4 clustering of textual data(2/6) • latent semantic indexing • map N-dimensional feature space F onto a lower dimensional subspace V. • LSI is based upon applying the SVD to the term-document matrix.
V.4 clustering of textual data(3/6) • Singular value decomposition (SVD) A = UDVT U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UUT = VTV = I • Dimension reduction
V.4 clustering of textual data(4/6) • Mediods: actual documents that are most similar to the centroids • Using Naïve Bayes Mixture models with the EM clustering algorithm
V.4 clustering of textual data(5/6) • Data abstraction in text clustering • generating meaningful and concise description of the cluster. • method of generating the label automatically • a title of the medoid document • several words common to the cluster documents can be shown. • a distinctive noun phrase.
V.4 clustering of textual data(6/6) • Evaluation of text clustering - the quality of the result? • purity • assume {L1,L2,...,Ln} are the manually labeled classes of documents, {C1,C2,...,Cm} the clusters returned by the clustering process • entropy, mutual information between classes and clusters