140 likes | 333 Views
Text mining. Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu. Introduction. Generic model Document preprocessing Text mining methods. Text Mining Tasks. Classification (supervised learning)
E N D
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu
Introduction • Generic model • Document preprocessing • Text mining methods
Text Mining Tasks • Classification (supervised learning) • Binary classification • Single label (multi-class) classification • Multi-label classification • Multi-level (hierarchical) classification • Clustering (unsupervised learning) • Summarization • Extraction: only parts of the original text • Abstraction: introduces text that is not included in the original text
Solutions • Classification • Decision tree • Neural network • Bayes network • Clustering • k-means
Document preprocessing • Goal: represent any text briefly, in a fixed number of parameters • Representation: vector space model
Vector space model • The text is tokenized to words • The words are canonized to base wordswe refer to base words as terms • A dictionary is built, that is the set of the terms in the document • The document is represented as a vector:the ith element of the vector is the number the ith term of the dictionary occurs in the document • The collection of documents is represened in the term-document matrix • Problem: the number of dimensions is too largeSolution: feature selection
Dimension Reduction • Feature Selection: find a subset of original variables • Document Frequency Thresholding • Omit the words with occurences greater than a threshold value, because these words are not discriminative • Omit the words with occurences less then a threshold value, because these words do not carry much information • Information gain based feature selection (information theory) • Chi-square based feature selection (statistics) • Feature Extraction: transform the data to fewer dims • Latent Semantic Indexing (LSI) • Principal Component Analysis (PCA) • Nonlinear methods
Latent Semantic Indexing (LSI) • SVD is applied to the term-document matrix • The features belonging to the k largest eigenvalues represent the term-document matrix well, these features are used • LSI regards documents with many common words as being semantically near
Principal Component Analysis (PCA) • Also called Karhunen-Loève transform (KLT) • A linear technique • Maps the data to a lower dimensional space in a way that the variance in the low-dimensional representation is maximized • The algorithm • The correlation matrix of the data is constructed • The eigenvectors and eigenvalues of the correlation matrix are calculated • The original space is reduced to the space spanned by the eigenvectors that belong to the largest eigenvalues
Kernel PCA • A nonlinear method • PCA + kernel trick • Kernel trick (generally) • we map observations from a general set S into a higher dimensional space V • we hope that the general classification in S reduces to the linear classification in V • the trick lets us avoid the calculation of mapping the observations from S to V • We use a learning algorithm that needs only the dot product operation in V • We use a mapping that allows to calculate the dot product within V by a kernel function K within S (the original space)
Manifold learning techniques • they minimize a cost function that retains local properties of the data • methods • Locally Linear Embedding (LLE) • Hessian LLE • Laplacian Eigenmaps • Local tangent space alignment (LTSA) • Maximum Variance Unfolding (MVU)
Maximum Variance Unfolding (MVU) • instead of defining a fixed kernel, it tries to learn the kernel using semidefinite programming • exactly preserves all pairwise distances between nearest neighbors • maximizes the distances between points that are not nearest neighbors