330 likes | 608 Views
Introduction to Automatic Text Classification. George Ke 13 Feb 2007. Overview. What is Text Classification (TC) Motivation of Automatic TC How Automatic TC is done Preprocessing k-Nearest Neighbour How we know it works Example: Email Classification Summary. What is Text Classification.
E N D
Introduction to Automatic Text Classification George Ke 13 Feb 2007
Overview • What is Text Classification (TC) • Motivation of Automatic TC • How Automatic TC is done • Preprocessing • k-Nearest Neighbour • How we know it works • Example: Email Classification • Summary
What is Text Classification • TC is commonly referred to as “the task of classifying natural language documents into a pre-defined set of semantic categories”. • For example: Entertainment, Health, Business, Technology etc.
Motivation of Automatic TC • Categorised data are easier for users to browse • Organisational view of data provides more effective retrieval • Efficient search is not enough
Motivation of Automatic TC • Manual text classification is time-consuming and expensive • MEDLINE (National Library of Medicine) indexed over 600k citations in 2006 using MEdical Subject Headings (23,000 categories) • Yahoo! Directories – over 500k categories
Motivation of Automatic TC • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT Category: Music? Health? Entertainment? R&B? USA? Medicine? UK?
How Automatic TC is done:Learning Task • Binary setting • Simplest problem: e.g., spam vs non-spam • Multi-Class setting • E.g. the task of classifying a news story into one of the categories in BBC directory • Can be treated as n binary tasks • Multi-Label setting • One document can be in multiple, exactly one or no category at all
How Automatic TC is done: Knowledge Engineering • In the late 1980s • Knowledge Engineering • Experts hand-craft classification rules • Rules • Rule 1:(R&Borstarorsoul )and(singerorartist )Music • Rule 2:(drugorprescription )andmedicationMedicine • Rule 3:(anxietyorpain or allergy)andacuteHealth • Rule 4 :(playor fame )andawardEntertainment • Rule …
How Automatic TC is done: Knowledge Engineering • Still inefficient and impractical when • Number of categories is large • Category definitions can change over time • Personalised application where an expert/knowledge engineer is unavailable • Inconsistency issues as rule set gets larger
How Automatic TC is done: Machine Learning • Since 1990s • The learning algorithm is given a small set of manually classified documents (training documents/dataset) • Documents to be classified are test documents/dataset • Produces a classification rule automatically • A.k.a a supervised learning problem • But, how do we make the learning algorithm learn from the training documents?
How Automatic TC is done: Machine Learning - Preprocessing • Pre-processing • Representing Text • Bag-of-words approach* – Term Frequency (TF) • Feature selection • Stopword removal • Feature construction • Stemming • Term weighting – DF, IDF *bag-of-words approach may not be the best method for other languages
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed by an accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer, who died last November, had pain killers, anxiety medication and allergy drugs in his bloodstream, said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication, and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert, and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killed byan accidental mixture of over-the-counter and prescription drugs according to a US coroner. The singer,who died last November,had pain killers, anxiety medication and allergy drugs in his bloodstream,said Cleveland coroner Kevin Chartrand. The official cause of death was acute intoxication,and the death was ruled to be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recorded as a solo artist, and worked with soul legends such as Anita Baker, Barry White and Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccordingto a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccordingto a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - Preprocessing • Fatal drug mix killed US R&B star Grammy-nominated R&B star Gerald Levert was killedbyan accidental mixtureof over-the-counterand prescription drugsaccording to a US coroner. The singer,who died last November,had pain killers, anxiety medicationand allergy drugsin his bloodstream,said Cleveland coroner Kevin Chartrand. The official causeof death was acute intoxication,and the death was ruledto be accidental, he said. Levert found fame in R&B trio LeVert,and had a UK top 10 hit with Casanova. He also recordedas a solo artist, and workedwith soul legendssuch as Anita Baker, Barry Whiteand Patti LaBelle. --- BBC Sunday, 11 February 2007, 13:03 GMT
How Automatic TC is done: Machine Learning - kNN • k-Nearest Neighbour (kNN) • Documents located close to each other are more likely to belong to the same class • k is a pre-defined parameter, which determines how many “neighbouring” training documents to be considered when classifying a test document • k is an integer = 1, 3 ,5, 7, 10… • Cosine Similarity is commonly used to determine the closeness of two documents
How Automatic TC is done: Machine Learning - kNN • Majority voting scheme
How Automatic TC is done: Machine Learning - kNN • Weighted-sum voting scheme
How Automatic TC is done: Machine Learning - kNN • The score for a category is the sum of the similarity scores between the point to be classified and all of its k-neighbours that belong to the given category. • To restate:where x is the new point; c is a class (e.g. black or white);d is a classified point among the k-nearest neighbours of x;sim(x,d) is the similarity between x and d;I(d,c) = 1 if point d belongs to class c;I(d,c) = 0 otherwise.
Exercise Imagine a language that is made up with five English letters, A, B, C, D and E with B, D and E being stopwords. The kNN system has been “trained” with 3 training documents, which belong to TWO different categories (see below) and the task is to classify a new document (test document) into one of the two categories using the process of automatic text classification with kNN (k=1). Preprocessed Training Documents: Unpreprocessed Test Document:
How we know it works • Given n test documents and m category in consideration, a classifier makes nm binary decisions. A two-by-two contingency table can be computed for each category
How we know it works • Performance measures • Precision (p) • Recall (r) • F1-measure • Accuracy
How we know it works • Precision = TP/(TP+FP) where TP + FP > 0 (otherwise undefined). • Of the times we predicted it was “in class”, how often are we correct? • Recall = TP/(TP+FN) where TP + FN > 0 (o.w. undefined). • Did we find all of those that belonged in the class?
How we know it works • F1-measure = 2(p r)/(p + r) • The weighted harmonic mean of precision and recall • Single performance measure to compare different learning algorithms • Accuracy = No. TP for all categories No. all test documents
Example: Email Classification • Emails are classified into folders • Multi-class setting • Emails are constantly being received • kNN is updated weekly, i.e. add received emails that were foldered to the training dataset • Text in email body and sender field is used to represent an email • BOW representation, stemming but no stopword removal • Dataset: Enron Email Corpus
Example: Email Classification • Results • User ID 5: received 87 emails in 18 weeks and keeps them in 7 folders • kNN correctly classified 72 emails • Accuracy = 72 / 87 = 0.8276 = 82.76% • User ID 70: received 881 emails in 114 weeks and keeps them in 69 folders • kNN correctly classified 517 emails • Accuracy = 517 / 881 = 0.5868 = 58.68% • More folders means more complex classification problem
Summary • Categorised data means more effective retrieval and search • Exponential growth of the number of electronic documents makes automatic TC is a must • Simple yet robust techniques can deliver practical solutions to real-world problems • kNN is one of the most effective methods (and arguably the simplest) • Personal Information Management (PIM) is a new direction for TC
Other Resources • Sebastiani, F. Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No. 1, 2002. • Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, Kluwer Academic Publishers, 2002