220 likes | 307 Views
Clustering in Concept Extraction. Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab. Outline. Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results.
E N D
Clustering in Concept Extraction Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab
Outline • Main Approach in Concept Extraction • Problems • Clustering Methods and LSI • Ideas and Our Works • Experimental Results
Main Approach in Concept Extraction • Main approach in Concept Extraction (we will say it CE) is using LSI. • LSI is a collection of one Matrix Algorithm and some Probabilistic Analyses on it for using on Term-Document Matrix. • At first we should create Term-Document matrix (using measures like TFiDF for indicating the importance of a term in a particular document), then give it to SVD (Singular Value Decomposition) algorithm and finally choose the first K columns as concepts.
Main Approach in Concept Extraction • Singular Value Decomposition is an algorithm for Matrix (we assume that matrix M is m×n) Decomposition to 3 matrices like U, S and V, such that S is an orthogonal matrix of singular values, U is eigenvectors of the Matrix MMT (Term correlation matrix) and V is eigenvectors of the Matrix MTM (Document Correlation Matrix). S is sorted descending. Therefore the first k elements of it or the first k columns of U or the first k rows of V are the most important values.
Main Approach in Concept Extraction • Steps of SVD can be explained as below: • 1- Select first column of matrix M1, we name it u1 • 2- Calculate the length of u1 and add it to first element. • 3- Then set B1=|u1|2/2 • 4- Then set U1=I-B1-1 u1u1T • 5- Then set M2=U1M1 • 6- Do it for first row and then repeat for other rows and columns • In general for ith column or row, in step 2 we should first set all elements before ith element equal to zero, then calculate the length and add the result to ith element.
Outline • Main Approach in Concept Extraction • Problems • Clustering Methods and LSI • Ideas and Our Works • Experimental Results
Problems • We can list the main problems of LSI as below • This method is based on the sum of square of distances (Σ(si-ti)2), so it is useful for data that has Gaussian (Normal) Distribution. But Term-Document Matrix has Poisson Distribution. • This method is very slow (its computation complexity is n3m and n<<m) • Poisson distribution is a Memory-less Distribution. In other words next occurrence of probabilistic variable X doesn’t depend on previous occurrences.
Outline • Main Approach in Concept Extraction • Problems • Clustering Methods and LSI • Ideas and Our Works • Experimental Results
Clustering Methods and LSI • There is a wide variety of methods in clustering. But we can group them as below: • Discrete Methods • Linear approaches • PCA • K-Means • K-Medians • K-Centers • LSH • Non-Linear approaches • KPCA • Embeddings • Artificial Intelligence Based Approaches
Clustering Methods and LSI • PCA is abbreviation for Principle Component Analysis and is a collection of methods that use eigenvector and eigenvalue properties for clustering. • So, SVD is one of the main approaches in PCA collection. • Recently, proved that K-Means and other members of its family can be listed in PCA family. • PCA family are linear approaches and can not cluster data that their independence is nonlinear. • PCA family is suitable for Gaussian Distribution.
Clustering Methods and LSI • One sample for nonlinear independence.
Clustering Methods and LSI • But K-Means has computational complexity equal to O(nm), and it is better than SVD (O(n3m)). • LSH is a member of linear methods and has good computational complexity.
Clustering Methods and LSI • KPCA (Kernel PCA) is a collection of methods in nonlinear clustering. • There are two groups in KPCA • Kernel functions: in this family we should invent a function that can convert nonlinear independence to linear one. For example of using Gaussian function see below.
Clustering Methods and LSI • Kernel Tricks: in this family we should convert original space to a higher order space with specific properties (some methods convert data to a Hilbert space that is a subset for Banach spaces) such that nonlinear independence will be converted to linear one and then we can use PCA methods. In this approach we should use Embedding methods. • Artificial Intelligence based clustering are very slow for our purpose.
Outline • Main Approach in Concept Extraction • Problems • Clustering Methods and LSI • Ideas and Our Works • Experimental Results
Ideas and Our Works • Our works will be on both finding an appropriate Kernel Function and an appropriate Embedding. • But we focus on Kernel Functions in this phase. • Our idea is a little different with main approach, we change distance function instead of points to reach the linearity. • There is a technique called “Copula” in statistics and probabilistics. Copula is a framework for finding a bi-variate distribution function for two probabilistic variable.
Ideas and Our Works • Main idea is as below: two variable are independent if the conjunctive probability of them is equal to product of their probabilities. So first we find an appropriate Copula function and then calculate the surrounding volume between copula surface and the surface of the product of probabilities of variables. This can be used as a measure for indicating the independence. Now we have a good Kernel Function. • There are a wide variety of copula function for general purposes and have been used in different researches and they did reach to good results.
Ideas and Our Works • This is a sample copula function obtained for a sample data, using Bernstein Polynomials Copula function.
Ideas and Our Works • Main advantages of our idea are as follows: • All of preprocessing computational complexity is about O(nm2). So if we using K-Means (O(nm)) then we obtain an algorithm with computational complexity equal to O(nm2) for detecting clusters with nonlinear independency (SVD has O(n3m) for linear independency and n>>m). • Copula functions don’t care about data distribution. Surprisingly, we can use them for two variables with different distributions. On the other hand SVD is suitable for Gaussian data distribution.
Outline • Main Approach in Concept Extraction • Problems • Clustering Methods and LSI • Ideas and Our Works • Experimental Results
Experimental Results • For testing our ideas we did the following: • First we obtained popular Datasets. They are all from University College Dublin (UCD), School of Computer Science and Informatics, Machine Learning Group. • Next we study the structure of SVD and K-Means (obtaining K-Means using core-sets) algorithms • We use MATLAB to implement algorithms • We test SVD and K-Means on datasets. For example one concept group that we obtain for BBC dataset is as following terms: juventu, cocain, romanian, alessandro, luciano, adrian, chelsea, ruin, bayern, drug, fifa, club, ... or another concept group about printers and so on.
Experimental Results • Now we should implement Copula in MATLAB and compare results with common SVD and K-Means.