1 / 10

Implementing Neural Networks for Text Classification: Data Sets

Implementing Neural Networks for Text Classification: Data Sets. Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo. Data Set Selection. There are two types of Data Sets that can be used:

Download Presentation

Implementing Neural Networks for Text Classification: Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Neural Networks for Text Classification:Data Sets Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo

  2. Data Set Selection • There are two types of Data Sets that can be used: • Compilation of documents from web, etc manually specifically for this project • Use of an existing Data Set that has been worked on by other researchers

  3. Advantages of Standard Data Sets • We don’t have to work for obtaining the data • Distribution of documents in the corpora used is even. Further, documents are well-classified • Comparison of results can be done with results from other researchers. This gives a comparative evaluation of the algorithm being used for classification.

  4. Most popular corpora • Most popular corpora used for text-classification research are: • Reuters-21578 data set (set of 21,578 newswire articles from Reuters – available as SGML documents – 1000 documents in each file) • 20-newsgroups data (a set of 20,000 newsgroup postings from 20 newsgroups – available as text files – one document per file) • WebKB database (web pages from 4 universities class)

  5. Reuters-21578 data set • Data is classified into five groups of classes:

  6. Reuters-21578 data set • Categories are overlapping and non-exhaustive. • Overlapping: one document can be classified into more than one categories. E.g. a document can be about ‘nasdaq’ (EXCHANGES) and about ‘USA’ (PLACES) in general. • Non-exhaustive: There are categories into which no documents fall, and there are documents that do not fall into any category. • Categories with 20+ occurrences are too few. ANN approach would probably not work with such few examples.

  7. Example of a Reuter-21578 document <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="13522" NEWID="8001"> <DATE>20-MAR-1987 16:54:10.55</DATE> <TOPICS><D>earn</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT> <TITLE>GANTOS INC &lt;GTOS> 4TH QTR JAN 31 NET</TITLE> <DATELINE> GRAND RAPIDS, MICH., March 20 -</DATELINE> <BODY> Shr 43 cts vs 37 cts Net 2,276,000 vs 1,674,000 Revs 32.6 mln vs 24.4 mln </BODY> </TEXT> </REUTERS>

  8. 20-newsgroup data set • Each document is in a separate text file. • There are 1000 documents from each newsgroup. • Each document has only one source newsgroup, so each document falls into only one category. • The task of classification pertains to determining the source newsgroup of the document.

  9. 20-newsgroups data set

  10. Example of a 20-newsgroup document Newsgroups: alt.atheism Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125 Organization: Penn State University Date: Fri, 23 Apr 1993 18:54:23 EDT From: <SMM125@psuvm.psu.edu> Message-ID: <93113.185423SMM125@psuvm.psu.edu> Subject: Re: YOU WILL ALL GO TO HELL!!! References: <93106.155002JSN104@psuvm.psu.edu> <1qq837$cm6@usenet.INS.CWRU.Edu> Lines: 1 jsn104 is jeremy scott noonan

More Related