Implementing Neural Networks for Text Classification: Data Sets

Implementing Neural Networks for Text Classification:Data Sets Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo

Data Set Selection • There are two types of Data Sets that can be used: • Compilation of documents from web, etc manually specifically for this project • Use of an existing Data Set that has been worked on by other researchers

Advantages of Standard Data Sets • We don’t have to work for obtaining the data • Distribution of documents in the corpora used is even. Further, documents are well-classified • Comparison of results can be done with results from other researchers. This gives a comparative evaluation of the algorithm being used for classification.

Most popular corpora • Most popular corpora used for text-classification research are: • Reuters-21578 data set (set of 21,578 newswire articles from Reuters – available as SGML documents – 1000 documents in each file) • 20-newsgroups data (a set of 20,000 newsgroup postings from 20 newsgroups – available as text files – one document per file) • WebKB database (web pages from 4 universities class)

Reuters-21578 data set • Data is classified into five groups of classes:

Reuters-21578 data set • Categories are overlapping and non-exhaustive. • Overlapping: one document can be classified into more than one categories. E.g. a document can be about ‘nasdaq’ (EXCHANGES) and about ‘USA’ (PLACES) in general. • Non-exhaustive: There are categories into which no documents fall, and there are documents that do not fall into any category. • Categories with 20+ occurrences are too few. ANN approach would probably not work with such few examples.

Example of a Reuter-21578 document <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="13522" NEWID="8001"> <DATE>20-MAR-1987 16:54:10.55</DATE> <TOPICS><D>earn</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT> <TITLE>GANTOS INC <GTOS> 4TH QTR JAN 31 NET</TITLE> <DATELINE> GRAND RAPIDS, MICH., March 20 -</DATELINE> <BODY> Shr 43 cts vs 37 cts Net 2,276,000 vs 1,674,000 Revs 32.6 mln vs 24.4 mln </BODY> </TEXT> </REUTERS>

20-newsgroup data set • Each document is in a separate text file. • There are 1000 documents from each newsgroup. • Each document has only one source newsgroup, so each document falls into only one category. • The task of classification pertains to determining the source newsgroup of the document.

20-newsgroups data set

Example of a 20-newsgroup document Newsgroups: alt.atheism Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125 Organization: Penn State University Date: Fri, 23 Apr 1993 18:54:23 EDT From: <SMM125@psuvm.psu.edu> Message-ID: <93113.185423SMM125@psuvm.psu.edu> Subject: Re: YOU WILL ALL GO TO HELL!!! References: <93106.155002JSN104@psuvm.psu.edu> <1qq837$cm6@usenet.INS.CWRU.Edu> Lines: 1 jsn104 is jeremy scott noonan

Implementing Neural Networks for Text Classification: Data Sets