Special topics on text mining [ Part I: text classification ]

Special topics on text mining[Part I: text classification] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

Classification algorithms and evaluation Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

Text classification • Machine learning approach to TC: • Recipe • Gather labeled documents • Construction of a classifier • Document representation • Preprocessing • Dimensionality reduction • Classification methods • Evaluation of a TC method

Machine learning approach to TC • Develop automated methods able to classify documents with a certain degree of success Labeled document Trained machine ? Training documents (Labeled) Learning machine (an algorithm) Unseen (test, query) document

Conventions n X={xij} y ={yj} m xi a w Slide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.

What is a learning algorithm? • A function: • Given:

Classification algorithms • Popular classification algorithms for TC are: • Naïve Bayes • Probabilistic approach • K-Nearest Neighbors • Example-based approach • Centroid-based classification • Prototype-based approach • Support Vector Machines • Kernel-based approach

Other popular classification algorithms • Linear classifiers (including SVMs) • Decision trees • Boosting, bagging and ensembles in general • Random forest • Neural networks

Sec.13.2 Naïve Bayes • It is the simplest probabilistic classifier used to classify documents • Based on the application of the Bayes theorem • Builds a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item • Categorization produces a posterior probability distribution over the possible categories given a description of an item. A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. MultinomialNaiveBayesforTextCategorizationRevisited. AustralianConferenceon Artificial Intelligence 2004: 488-499

Naïve Bayes • Bayes theorem: • Why? • We know that: • Then • Then

Sec.13.2 Naïve Bayes • For a document d and a class cj C . . . . t1 t2 t|V| • Assuming terms are independent of each other given the class (naïve assumption) • Assuming each document is equally probable

Sec.13.2 Bayes’ Rule for text classification • For a document d and a class cj

Sec.13.2 Bayes’ Rule for text classification • For a document d and a class cj • Estimation of probabilities Smoothing to avoid overfitting Prior probability of class cj Probability of occurrence of word ti in class cj

Naïve Bayes classifier • Assignment of the class: • Assignment using underflow prevention: • Multiplying lots of probabilities can result in floating-point underflow • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities

Comments on NB classifier • Very simple classifier which works very well on numerical and textual data. • Very easy to implement and computationally cheap when compared to other classification algorithms. • One of its major limitations is that it performs very poorly when features are highly correlated. • Concerning text classification, it fails to consider the frequency of word occurrences in the feature vector.

Sec.13.2 Naïve Bayes revisited • For a document d and a class cj • Estimation of probabilities • What is the assumed probability distribution? Prior probability of class cj Probability of occurrence of word ti in class cj

Bernoulli event model • A document is a binary vector over the space of words: • where B is a multivariate Bernoulli random variable of length |V| associated to document A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

Bernoulli event model • Estimation of probabilities: • Problems with this formulation? • Word frequency occurrence is not taken into account A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

Multinomial event model • The multinomial model captures word frequency information in documents • A document is an ordered sequence of word events drawn from the same vocabulary • Each document is drawn from a multinomial distribution of words with as many independent trials as the length of the document A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

Multinomial event model • What is a multinomial distribution? If a given trial can result in the k outcomes E1, …, Ek with probabilities p1, …, pk, then the probability distribution of the RVs X1, …, Xk, representing the number of occurrences for E1, …, Ek in n independent trials is: # times event Ek occur Probability that event Ekoccurs # of ways in which the sequence E1, …, Ek can occur R. E. Walpole, et al. Probability and Statistics for Engineers and Scientists. 8th Edition, Prentice Hall, 2007.

Multinomial event model • A document is a multinomial experiment with |d| independent trials # occurrences of term ti in document d A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

Multinomial event model • Estimation of probabilities: • Then, what to do with real valued data? Assume a probability density function (e.g., a Gaussian pdf) I. Guyon. Naïve Bayes Algorithm in CLOP. CLOP documentation, 2005.

KNN: K-nearest neighbors classifier • Do not build explicit declarative representations of categories. • This kind of methods are called lazy learners • “Training” for such classifiers consists of simply storing the representations of the training documents together with their category labels. • To decide whether a document d belongs to the category c, kNN checks whether the k training documents most similar to d belong to c. • Key element: a definition of “similarity” between documents

KNN: K-nearest neighbors classifier Positive examples Negative examples

KNN – the algorithm • Given a new document d: • Find the k most similar documents from the training set. • Common similarity measures are the cosine similarity and the Dice coefficient. • Assign the class to d by considering the classes of its k nearest neighbors • Majority voting scheme • Weighted-sum voting scheme

Common similarity measures • Dice coefficient • Cosine measure wki indicates the weight of word k in document i

Selection of K k pair or impair?

Decision surface http://clopinet.com/CLOP K=1

Selection of K How to select a good value for K?

The weighted-sum voting scheme Other alternatives for computing the weights?

KNN - comments • One of the best-performing text classifiers. • It is robust in the sense of not requiring the categories to be linearly separated. • The major drawback is the computational effort during classification. • Other limitation is that its performance is primarily determined by the choice of k as well as the distance metric applied.

Centroid-based classification • This method has two main phases: • Training phase: it considers the construction of one single representative instance, called prototype, for each class. • Test phase: each unlabeled document is compared against all prototypes and is assigned to the class having the greatest similarity score. • Different from k-NN which represent each document in the training set individually. How to compute the prototypes? H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 424—431, 2000.

Centroid-based classification • T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2009.

Calculating the centroids • Centroid as average • Centroid as sum • Centroid as normalized sum • Centroid computation using the Rocchio formula

Comments on Centroid-Based Classification • Computationally simple and fast model • Short training and testing time • Good results in text classification • Amenable to changes in the training set • Can handle imbalanced document sets • Disadvantages: • Inadequate for non-linear classification problems • Problem of inductive bias or model misfit • Classifiers are tuned to the contingent characteristics of the training data rather than the constitutive characteristics of the categories

Linear models • Idea: learning a linear function (in the parameters) that allow us to separate data • f(x) = wx+b = Sj=1:nwjxj +b (linear discriminant) • f(x) = w F(x)+b = Sjwjfj(x) +b (the perceptron) • f(x) = Si=1:maik(xi,x) +b (Kernel-basedmethods) Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork,In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000.

Linear models • Classification of DNA micro-arrays ? x2 Cancer ? No Cancer x1

Linear models http://clopinet.com/CLOP Linear support vector machine

Linear models http://clopinet.com/CLOP Non-linear support vector machine

Linear models http://clopinet.com/CLOP Kernel ridge regression

Linear models http://clopinet.com/CLOP Zarbi classifier

Linear models http://clopinet.com/CLOP Naïve Bayesian classifier

Support vector machines (SVM) • A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances. • SVMs selects the hyperplanethat maximizes the marginaround it. • Hyperplanes are fullydetermined by a small subsetof the training instances, calledthe support vectors. Support vectors Maximize margin

Support vector machines (SVM) Subject to: When data are linearly separable we have:

Special topics on text mining [ Part I: text classification ]