1 / 32

Introduction to Automatic Text Classification

Introduction to Automatic Text Classification. George Ke 13 Feb 2007. Overview. What is Text Classification (TC) Motivation of Automatic TC How Automatic TC is done Preprocessing k-Nearest Neighbour How we know it works Example: Email Classification Summary. What is Text Classification.

Download Presentation

Introduction to Automatic Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Automatic Text Classification George Ke 13 Feb 2007

  2. Overview • What is Text Classification (TC) • Motivation of Automatic TC • How Automatic TC is done • Preprocessing • k-Nearest Neighbour • How we know it works • Example: Email Classification • Summary

  3. What is Text Classification • TC is commonly referred to as “the task of classifying natural language documents into a pre-defined set of semantic categories”. • For example: Entertainment, Health, Business, Technology etc.

  4. Motivation of Automatic TC • Categorised data are easier for users to browse • Organisational view of data provides more effective retrieval • Efficient search is not enough

  5. Motivation of Automatic TC • Manual text classification is time-consuming and expensive • MEDLINE (National Library of Medicine) indexed over 600k citations in 2006 using MEdical Subject Headings (23,000 categories) • Yahoo! Directories – over 500k categories

  6. Motivation of Automatic TC • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT Category: Music? Health? Entertainment? R&B? USA? Medicine? UK?

  7. How Automatic TC is done:Learning Task • Binary setting • Simplest problem: e.g., spam vs non-spam • Multi-Class setting • E.g. the task of classifying a news story into one of the categories in BBC directory • Can be treated as n binary tasks • Multi-Label setting • One document can be in multiple, exactly one or no category at all

  8. How Automatic TC is done: Knowledge Engineering • In the late 1980s • Knowledge Engineering • Experts hand-craft classification rules • Rules • Rule 1:(R&Borstarorsoul )and(singerorartist )Music • Rule 2:(drugorprescription )andmedicationMedicine • Rule 3:(anxietyorpain or allergy)andacuteHealth • Rule 4 :(playor fame )andawardEntertainment • Rule …

  9. How Automatic TC is done: Knowledge Engineering • Still inefficient and impractical when • Number of categories is large • Category definitions can change over time • Personalised application where an expert/knowledge engineer is unavailable • Inconsistency issues as rule set gets larger

  10. How Automatic TC is done: Machine Learning • Since 1990s • The learning algorithm is given a small set of manually classified documents (training documents/dataset) • Documents to be classified are test documents/dataset • Produces a classification rule automatically • A.k.a a supervised learning problem • But, how do we make the learning algorithm learn from the training documents?

  11. How Automatic TC is done: Machine Learning - Preprocessing • Pre-processing • Representing Text • Bag-of-words approach* – Term Frequency (TF) • Feature selection • Stopword removal • Feature construction • Stemming • Term weighting – DF, IDF *bag-of-words approach may not be the best method for other languages

  12. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  13. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  14. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed byan accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer,who died last November,had pain killers, anxiety medication and allergy drugs in his bloodstream,said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication,and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  15. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccordingto a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  16. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccordingto a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  17. How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccording to a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT

  18. How Automatic TC is done: Machine Learning - kNN • k-Nearest Neighbour (kNN) • Documents located close to each other are more likely to belong to the same class • k is a pre-defined parameter, which determines how many “neighbouring” training documents to be considered when classifying a test document • k is an integer = 1, 3 ,5, 7, 10… • Cosine Similarity is commonly used to determine the closeness of two documents

  19. How Automatic TC is done: Machine Learning - kNN

  20. How Automatic TC is done: Machine Learning - kNN • Majority voting scheme

  21. How Automatic TC is done: Machine Learning - kNN • Weighted-sum voting scheme

  22. How Automatic TC is done: Machine Learning - kNN • The score for a category is the sum of the similarity scores between the point to be classified and all of its k-neighbours that belong to the given category. • To restate:where x is the new point; c is a class (e.g. black or white);d is a classified point among the k-nearest neighbours of x;sim(x,d) is the similarity between x and d;I(d,c) = 1 if point d belongs to class c;I(d,c) = 0 otherwise.

  23. Exercise Imagine a language that is made up with five English letters, A, B, C, D and E with B, D and E being stopwords. The kNN system has been “trained” with 3 training documents, which belong to TWO different categories (see below) and the task is to classify a new document (test document) into one of the two categories using the process of automatic text classification with kNN (k=1). Preprocessed Training Documents: Unpreprocessed Test Document:

  24. How we know it works • Given n test documents and m category in consideration, a classifier makes nm binary decisions. A two-by-two contingency table can be computed for each category

  25. How we know it works • Performance measures • Precision (p) • Recall (r) • F1-measure • Accuracy

  26. How we know it works • Precision = TP/(TP+FP) where TP + FP > 0 (otherwise undefined). • Of the times we predicted it was “in class”, how often are we correct? • Recall = TP/(TP+FN) where TP + FN > 0 (o.w. undefined). • Did we find all of those that belonged in the class?

  27. How we know it works • F1-measure = 2(p r)/(p + r) • The weighted harmonic mean of precision and recall • Single performance measure to compare different learning algorithms • Accuracy = No. TP for all categories No. all test documents

  28. Example: Email Classification • Emails are classified into folders • Multi-class setting • Emails are constantly being received • kNN is updated weekly, i.e. add received emails that were foldered to the training dataset • Text in email body and sender field is used to represent an email • BOW representation, stemming but no stopword removal • Dataset: Enron Email Corpus

  29. Example: Email Classification • Results • User ID 5: received 87 emails in 18 weeks and keeps them in 7 folders • kNN correctly classified 72 emails • Accuracy = 72 / 87 = 0.8276 = 82.76% • User ID 70: received 881 emails in 114 weeks and keeps them in 69 folders • kNN correctly classified 517 emails • Accuracy = 517 / 881 = 0.5868 = 58.68% • More folders means more complex classification problem

  30. Summary • Categorised data means more effective retrieval and search • Exponential growth of the number of electronic documents makes automatic TC is a must • Simple yet robust techniques can deliver practical solutions to real-world problems • kNN is one of the most effective methods (and arguably the simplest) • Personal Information Management (PIM) is a new direction for TC

  31. Other Resources • Sebastiani, F. Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No. 1, 2002. • Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, Kluwer Academic Publishers, 2002

More Related