1 / 77

Special topics on text mining [ Part I: text classification ]

Special topics on text mining [ Part I: text classification ]. Hugo Jair Escalante , Aurelio Lopez , Manuel Montes and Luis Villaseñor. Classification algorithms and evaluation. Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor. Text classification.

brendy
Download Presentation

Special topics on text mining [ Part I: text classification ]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special topics on text mining[Part I: text classification] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

  2. Classification algorithms and evaluation Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

  3. Text classification • Machine learning approach to TC: • Recipe • Gather labeled documents • Construction of a classifier • Document representation • Preprocessing • Dimensionality reduction • Classification methods • Evaluation of a TC method

  4. Machine learning approach to TC • Develop automated methods able to classify documents with a certain degree of success Labeled document Trained machine ? Training documents (Labeled) Learning machine (an algorithm) Unseen (test, query) document

  5. Conventions n X={xij} y ={yj} m xi a w Slide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.

  6. What is a learning algorithm? • A function: • Given:

  7. Classification algorithms • Popular classification algorithms for TC are: • Naïve Bayes • Probabilistic approach • K-Nearest Neighbors • Example-based approach • Centroid-based classification • Prototype-based approach • Support Vector Machines • Kernel-based approach

  8. Other popular classification algorithms • Linear classifiers (including SVMs) • Decision trees • Boosting, bagging and ensembles in general • Random forest • Neural networks

  9. Sec.13.2 Naïve Bayes • It is the simplest probabilistic classifier used to classify documents • Based on the application of the Bayes theorem • Builds a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item • Categorization produces a posterior probability distribution over the possible categories given a description of an item. A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. MultinomialNaiveBayesforTextCategorizationRevisited. AustralianConferenceon Artificial Intelligence 2004: 488-499

  10. Naïve Bayes • Bayes theorem: • Why? • We know that: • Then • Then

  11. Sec.13.2 Naïve Bayes • For a document d and a class cj C . . . . t1 t2 t|V| • Assuming terms are independent of each other given the class (naïve assumption) • Assuming each document is equally probable

  12. Sec.13.2 Bayes’ Rule for text classification • For a document d and a class cj

  13. Sec.13.2 Bayes’ Rule for text classification • For a document d and a class cj • Estimation of probabilities Smoothing to avoid overfitting Prior probability of class cj Probability of occurrence of word ti in class cj

  14. Naïve Bayes classifier • Assignment of the class: • Assignment using underflow prevention: • Multiplying lots of probabilities can result in floating-point underflow • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities

  15. Comments on NB classifier • Very simple classifier which works very well on numerical and textual data. • Very easy to implement and computationally cheap when compared to other classification algorithms. • One of its major limitations is that it performs very poorly when features are highly correlated. • Concerning text classification, it fails to consider the frequency of word occurrences in the feature vector.

  16. Sec.13.2 Naïve Bayes revisited • For a document d and a class cj • Estimation of probabilities • What is the assumed probability distribution? Prior probability of class cj Probability of occurrence of word ti in class cj

  17. Bernoulli event model • A document is a binary vector over the space of words: • where B is a multivariate Bernoulli random variable of length |V| associated to document A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

  18. Bernoulli event model • Estimation of probabilities: • Problems with this formulation? • Word frequency occurrence is not taken into account A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

  19. Multinomial event model • The multinomial model captures word frequency information in documents • A document is an ordered sequence of word events drawn from the same vocabulary • Each document is drawn from a multinomial distribution of words with as many independent trials as the length of the document A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

  20. Multinomial event model • What is a multinomial distribution? If a given trial can result in the k outcomes E1, …, Ek with probabilities p1, …, pk, then the probability distribution of the RVs X1, …, Xk, representing the number of occurrences for E1, …, Ek in n independent trials is: # times event Ek occur Probability that event Ekoccurs # of ways in which the sequence E1, …, Ek can occur R. E. Walpole, et al. Probability and Statistics for Engineers and Scientists. 8th Edition, Prentice Hall, 2007.

  21. Multinomial event model • A document is a multinomial experiment with |d| independent trials # occurrences of term ti in document d A. McCallum, K. Nigam. A comparison of Event Models for Naïve Bayes Text Classification. Proceedings of the AAAI/ICML Workshop on Learning for Text Categorization, pp. 41—48, 1998

  22. Multinomial event model • Estimation of probabilities: • Then, what to do with real valued data? Assume a probability density function (e.g., a Gaussian pdf) I. Guyon. Naïve Bayes Algorithm in CLOP. CLOP documentation, 2005.

  23. KNN: K-nearest neighbors classifier • Do not build explicit declarative representations of categories. • This kind of methods are called lazy learners • “Training” for such classifiers consists of simply storing the representations of the training documents together with their category labels. • To decide whether a document d belongs to the category c, kNN checks whether the k training documents most similar to d belong to c. • Key element: a definition of “similarity” between documents

  24. KNN: K-nearest neighbors classifier Positive examples Negative examples

  25. KNN: K-nearest neighbors classifier Positive examples Negative examples

  26. KNN: K-nearest neighbors classifier Positive examples Negative examples

  27. KNN: K-nearest neighbors classifier Positive examples Negative examples

  28. KNN – the algorithm • Given a new document d: • Find the k most similar documents from the training set. • Common similarity measures are the cosine similarity and the Dice coefficient. • Assign the class to d by considering the classes of its k nearest neighbors • Majority voting scheme • Weighted-sum voting scheme

  29. Common similarity measures • Dice coefficient • Cosine measure wki indicates the weight of word k in document i

  30. Selection of K k pair or impair?

  31. Decision surface http://clopinet.com/CLOP K=1

  32. Decision surface http://clopinet.com/CLOP K=2

  33. Decision surface http://clopinet.com/CLOP K=5

  34. Decision surface http://clopinet.com/CLOP K=10

  35. Selection of K How to select a good value for K?

  36. The weighted-sum voting scheme Other alternatives for computing the weights?

  37. KNN - comments • One of the best-performing text classifiers. • It is robust in the sense of not requiring the categories to be linearly separated. • The major drawback is the computational effort during classification. • Other limitation is that its performance is primarily determined by the choice of k as well as the distance metric applied.

  38. Centroid-based classification • This method has two main phases: • Training phase: it considers the construction of one single representative instance, called prototype, for each class. • Test phase: each unlabeled document is compared against all prototypes and is assigned to the class having the greatest similarity score. • Different from k-NN which represent each document in the training set individually. How to compute the prototypes? H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 424—431, 2000.

  39. Centroid-based classification • T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2009.

  40. Calculating the centroids • Centroid as average • Centroid as sum • Centroid as normalized sum • Centroid computation using the Rocchio formula

  41. Comments on Centroid-Based Classification • Computationally simple and fast model • Short training and testing time • Good results in text classification • Amenable to changes in the training set • Can handle imbalanced document sets • Disadvantages: • Inadequate for non-linear classification problems • Problem of inductive bias or model misfit • Classifiers are tuned to the contingent characteristics of the training data rather than the constitutive characteristics of the categories

  42. Linear models • Idea: learning a linear function (in the parameters) that allow us to separate data • f(x) = wx+b = Sj=1:nwjxj +b (linear discriminant) • f(x) = w F(x)+b = Sjwjfj(x) +b (the perceptron) • f(x) = Si=1:maik(xi,x) +b (Kernel-basedmethods) Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork,In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000.

  43. Linear models • Classification of DNA micro-arrays ? x2 Cancer ? No Cancer x1

  44. Linear models http://clopinet.com/CLOP Linear support vector machine

  45. Linear models http://clopinet.com/CLOP Non-linear support vector machine

  46. Linear models http://clopinet.com/CLOP Kernel ridge regression

  47. Linear models http://clopinet.com/CLOP Zarbi classifier

  48. Linear models http://clopinet.com/CLOP Naïve Bayesian classifier

  49. Support vector machines (SVM) • A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances. • SVMs selects the hyperplanethat maximizes the marginaround it. • Hyperplanes are fullydetermined by a small subsetof the training instances, calledthe support vectors. Support vectors Maximize margin

  50. Support vector machines (SVM) Subject to: When data are linearly separable we have:

More Related