1 / 6

Implementation Details of the Text Classification Project

This text discusses the implementation details of a text classification project, including the feature selection step using Information Gain and the classification algorithm.

skersey
Download Presentation

Implementation Details of the Text Classification Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation Details of the Text Classification Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo Spring 2001

  2. Feature Selection Step • We select keywords from text by using some way of scoring words. Here, Information Gain is being used. • For each unique word, the number of documents in each class, in which the word occurs, is noted.

  3. Feature Selection Step - Algorithm for each document d in training set for each word w if w has been encountered before increment the document count for Category(d) in record for w else create a new data record for w increment the document count for Category(d) in record for w for each word w using the record for w, calculate Information Gain Select NUM_KEYWORDS with highest Information Gain.

  4. Feature Selection

  5. Information Gain G (t) = - i=1 to m Pr (ci) log Pr (ci) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) Pr (ci) = 1/ 20 Pr (t) = (i=1 to m Catm(t)) / (i=1 to m j=1 to w Catm(j)) Pr (ci|t) = Catm (t) / i=1 to m Catm(t)

  6. Classification Algorithm

More Related