Text Classification and Naïve Bayes

Text Classification and Naïve Bayes Chien Chin Chen Department of Information Management National Taiwan University

Basic (1/6) • Search engines usually serve for ad hoc retrieval – queries represent transient information needs. • However, many users have ongoing information needs. • For example, you are interested in developments in “multicore computer chips”. • One way of tracking this is to issue the query multicore AND computer AND chip against an index of recent newswire articles each morning. • Many systems support standing queries, which are periodically executed on new documents added over time. • However, the query would tend to miss many relevant articles containing terms such as processors.

Basic (2/6) • To achieve good recall … • You have to refine the query over and over again. • But … the query can become quite complex and error-prone. • A general notion of standing query – classification. • Rough definition: • Given a set of classes. • We seek to determine which class(es) a given (new) object belongs to. • For instance, to divide new newswire articles into two classes: • Documents about multicore computer chips. • Documents notabout multicore computer chips.

Basic (3/6) • Often, a class is a more general subject area. • Such as China, sports, … • General classes are usually referred to as topics. • The classification task is then called topic classification, topic spotting, text classification, text categorization. • The notion of classification is very general: • Images classification. • Security classification. • …

Basic (4/6) • A computer is not essential for classification. • Many classification tasks have traditionally been solved manually. • For example, books in a library are assigned Library of Congress categories by a librarian. • But manual classification is expensive to scale. • An alternative approach is classification by the use of rules. • Most commonly written by hand. • Usually have good scaling properties. • But creating and maintaining them over time is labor-intensive.

Basic (5/6) • Apart from manual classification and hand-crafted rules, here, we focus on machine learning based (text) classification. • In this approach, the decision criterion of the textclassifier is learnedautomatically from training data. • Obviously, the need for manual classification is not eliminated. • The training documents come from a person who has labeled them. • Labeling – the process of annotating each document with its class. • Usually more easier than rule writing.

Basic (6/6) • This type of classification is a type of supervised learning. • A supervisor(oracle) serves as a teacher directing the learning process. • That is, defining the classes and labeling training documents.

The Text Classification Problem (1/4) • Definition: • Document space X: • Some type of high-dimensional space. • Dimensions – terms. • Classes C: • C = {c1, c2, …, cJ}. • Defined by human experts for the needs of an application. • E.g., C = {China, UK, …} • Training set D: • D = {<d, c>}, where <d, c> XC. • E.g., <d, c> = <Beijing joins …, China>

The Text Classification Problem (2/4) • Using a learning method or learning algorithm, we wish to learn a classifier(or classification function ): • We denote the supervised learning method by and write • The learning method takes the training set D as input and return the learned classifier .

The Text Classification Problem (3/4) • Once we have learned , we can apply it to the test set(or test data). • Chosen independently of the training data.

The Text Classification Problem (4/4) • The classes in text classification can be a class hierarchy. • More practical and useful. • For simplicity, we assume that the classes form a set with no subset relationships between them. • Documents to be classified can be a member of more than one class. • For instance, a document about the 2008 Olympics should be a member of two classes: the China class and the sports class.

Naïve Bayes Text Classification (1/10) • Multinomial Naïve Bayes model(or NB model) is a probabilistic learning method. • In text classification, our goal is to find the “best” class for the document: The probability of a document d being in class c. We do not know the true values of the distributions but can approximate them by using information of the given training data

Naïve Bayes Text Classification (2/10) • E.g., d = Chinese Beijing Chinese, then P(d|c)=P(Chinese, Beijing, Chinese | c) • Why?? …

Naïve Bayes Text Classification (3/10) • Very complex … to reduce the number of probability parameters, we make the Naïve Bayes conditional independence assumption. • That is, attributes (terms) are independent of each other given the class: • or … P( ti | t1,…,ti-1,c) = P( ti | c) • So …

Naïve Bayes Text Classification (4/10) • Then … How to acquire the probabilities from training data D ?? The number of occurrences of t in D from class c The number of training documents in class c The number of training documents. The total number of terms in D from class c. • P(c) is often called the prior probability of c. • P(c|d) is often called the posterior probability of c, because it reflects our confidence that c holds after we have seen d.

Naïve Bayes Text Classification (5/10) • Problems with • Many conditional probabilities are multiplied  result in a floating point underflow!! • Solution – adding logarithms of probabilities instead of multiplying probabilities. • log(xy) = log(x) + log(y) • Then

Naïve Bayes Text Classification (6/10) • Problems with • Zero probability for a term-class combination that did not occur in the training data. • If occurrences of the term ‘WTO’ in the training data only occurred in China documents. • Then the P(‘WTO’ | c) for the other classes will be zero. • Now, the one-sentence document “Britain is a member of the WTO” will get a zero probability for the UK class. • No matter how strong the evidence for the class UK from other terms!!

Naïve Bayes Text Classification (7/10) • The probability is zero because of sparseness: • The training data is never large enough to represent the frequency of rare events adequately. • Solution – add-onesmoothing(or Laplace smoothing).

Naïve Bayes Text Classification (8/10) • Naïve Bayes algorithm – training phase. TrainMultinomialNB(C, D) V ExtractVocabulary(D) N  CountDocs(D) for each c in C do Nc  CountDocsInClass(D, c) prior[c]  Nc / C textc  ConcatenateTextOfAllDocsInClass(D, c) for each t in V do Tct  CountKokensOfTerm(textc, t) for each t in V do condprob[t][c]  (Tct+1) / ∑(Tct’+1) return V, prior, condprob

Naïve Bayes Text Classification (9/10) • Naïve Bayes algorithm – testing phase. ApplyMultinomialNB(C, V, prior, condProb, d) W ExtractTokensFromDoc(V, d) for each c in C do score[c]  log prior[c] for each t in W do score[c] += log condprob[t][c] return argmaxcscore[c]

Naïve Bayes Text Classification (10/10) • Training: • Vocabulary V = {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan} and |V| = 6. • P(c) = 3/4 and P(~c) = 1/4. • P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7… • P(Chinese|~c) = (1+1) / (3+6) = 2/9 … • Testing: • P(c|d) 3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003 • P(~c|d) 1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001

Text Classification and Naïve Bayes