1 / 17

IR Homework #2

IR Homework #2. By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification. Goal: to classify each document into predefined categories Input : Reuters-21578 test collection predefined categories labeled documents for training test documents for testing

bess
Download Presentation

IR Homework #2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Homework #2 By J. H. Wang May 9, 2014

  2. Programming Exercise #2: Text Classification • Goal: to classify each document into predefined categories • Input: Reuters-21578 test collection • predefined categories • labeled documents for training • test documents for testing • Output: a classifier for each category

  3. Input: Training and Test Sets • Using Reuters-21578 collection • Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html • 21,578 news articles in 1987 (28.0MB uncompressed) • Distributed in 22 files in SGML format • preprocessing of SGML tags • File format: http://kdd.ics.uci.edu/databases/reuters21578/README.txt

  4. Predefined Categories in Reuters-21578 • 5 category sets • Exchanges: 39 categories • Orgs: 56 categories • People: 267 categories • Places: 175 categories • Topics: 135 categories •  In this homework, ONLY the 135 Topical categories are considered in classification • 10 largest classes • Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn

  5. Training and Test Sets • Using Reuters-21578 for text classification • Modified Lewis (ModLewis) Split • Training: 13,625 • Test: 6,188 • Modified Apte (ModApte) Split: used in this homework • Training: 9,603 • Test: 3,299 • Modified Hayes (ModHayes) Split • Training: 20,856 • Test: 722

  6. An Example Reuters Article Training set in ModApte split <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1"> <DATE>26-FEB-1987 15:01:01.79</DATE> <TOPICS><D>cocoa</D></TOPICS> <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> <PEOPLE></PEOPLE> … <TEXT>&#2; <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE> SALVADOR, Feb 26 - </DATELINE> <BODY>Showers continued throughout the week in … &#3;</BODY></TEXT> </REUTERS> Topical category Text content

  7. Output: A Classifier • Either write your own programs or use open source tools to implement any one of the following text classification methods: • Naïve Bayes (NB) classification (Ch.13) • Rocchio classification (Ch.14) • kNN classification (Ch.14) • SVM classification (Ch.15) • …

  8. Sec.14.1 Test Document of what class? Government Science Arts

  9. Rocchio Classification • Definition of centroid • Where Dcis the set of all documents that belong to class c and v(d) is the vector space representation of d. • Assign test documents to the category with the closest prototype vector based on cosine similarity

  10. Tasks and Evaluations • Your system should be able to complete the following tasks using the ModApte split in Reuters-21578 dataset • Training • Testing • Evaluation of your system • Training: efficiency • Testing: precision/recall/F-measure

  11. Example: Rocchio Classification Training Training docs Centroid Calculation HTML Parsing centroids Test doc. Cosine Similarity class Evaluation Testing P, R, F1

  12. Example Steps in Rocchio Classification • Parse the HTML documents in the Reuters-21578 dataset. Find out the text body, topics, and separate them into training and test document. • Body as content, topics as class • For each document, calculate the TF-IDF weights from the text body as a vector. • For each training document, calculate the centroid by summing all the vectors in each topic class. • So you will get 135 centroids, one for each topic class. • For each test document, find out the most similar centroid using cosine similarity as the class it belongs to. • Compare the class with the answer (in the topics tag), and evaluate how many test documents are correctly classified.

  13. Optional Functionalities • Feature selection: (Sec. 13.5) • mutual information • chi-square • … • User Interface • For selecting test documents • Visualization of classification result • …

  14. Submission • Your submission *should* include • The source code (and your executable file) • A complete user manual (or a UI)for testing • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: extended to three weeks (May 30, 2014)

  15. Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.31/net2ftp • FTP server: localhost • User name & password: Your student ID • Preparing your submission file: as one single compressed file • Remember to specify the names of yourteam members and student ID in the files and documentation • If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)

  16. Evaluation • Two options: • Your system automatically classifies all test documents, and displays classification results and the effectiveness (precision, recall, F-measure, accuracy) • The preferred option • Your system can randomly select some test documents (by their IDs), and run your classifier to show the classification result (both your classifier output, and the answer) • Minimum requirement • Training and testing phases can be successfully completed • Optional features will be considered as bonus • E.g. feature selection, UI, visualization, … • You might be required to demo if the classifier submitted was unable to run by TA

  17. Any Questions or Comments?

More Related