60 likes | 77 Views
This text discusses the implementation details of a text classification project, including the feature selection step using Information Gain and the classification algorithm.
E N D
Implementation Details of the Text Classification Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo Spring 2001
Feature Selection Step • We select keywords from text by using some way of scoring words. Here, Information Gain is being used. • For each unique word, the number of documents in each class, in which the word occurs, is noted.
Feature Selection Step - Algorithm for each document d in training set for each word w if w has been encountered before increment the document count for Category(d) in record for w else create a new data record for w increment the document count for Category(d) in record for w for each word w using the record for w, calculate Information Gain Select NUM_KEYWORDS with highest Information Gain.
Information Gain G (t) = - i=1 to m Pr (ci) log Pr (ci) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) Pr (ci) = 1/ 20 Pr (t) = (i=1 to m Catm(t)) / (i=1 to m j=1 to w Catm(j)) Pr (ci|t) = Catm (t) / i=1 to m Catm(t)