1 / 14

IR Homework #3

IR Homework #3. By J. H. Wang May 8, 2013. Programming Exercise #3: Text Classification. Goal: to classify each document into predefined categories Input : Reuters-21578 test collection predefined categories labeled documents for training test documents for testing

hanae-ross
Download Presentation

IR Homework #3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Homework #3 By J. H. Wang May 8, 2013

  2. Programming Exercise #3: Text Classification • Goal: to classify each document into predefined categories • Input: Reuters-21578 test collection • predefined categories • labeled documents for training • test documents for testing • Output: a classifier for each category

  3. Input: Training and Test Sets • Using Reuters-21578 collection • Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html • 21,578 news articles in 1987 (28.0MB uncompressed) • Distributed in 22 files in SGML format • preprocessing of SGML tags • File format: http://kdd.ics.uci.edu/databases/reuters21578/README.txt

  4. Predefined Categories in Reuters-21578 • 5 category sets • Exchanges: 39 categories • Orgs: 56 categories • People: 267 categories • Places: 175 categories • Topics: 135 categories •  In this homework, we ONLY consider classification in the 135 Topical categories • 10 largest classes • Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn

  5. Training and Test Sets • Using Reuters-21578 for text classification • Modified Lewis (ModLewis) Split • Training: 13,625 • Test: 6,188 • Modified Apte (ModApte) Split: used in this homework • Training: 9,603 • Test: 3,299 • Modified Hayes (ModHayes) Split • Training: 20,856 • Test: 722

  6. Output: A Classifier • Either your own program(s) or open source tools • Naïve Bayes (NB) classification (Ch.13) • Rocchio classification (Ch.14) • kNN classification (Ch.14) • SVM classification (Ch.15) • …

  7. Sec.14.1 Test Document of what class? Government Science Arts

  8. Rocchio Classification • Definition of centroid • Where Dcis the set of all documents that belong to class c and v(d) is the vector space representation of d. • Assign test documents to the category with the closest prototype vector based on cosine similarity

  9. Evaluation of Classification Results • Test queries randomly selected from Reuters-21578 test set • Training: efficiency • Testing: precision/recall/F-measure

  10. Optional Functionalities • Feature selection: (Sec. 13.5) • mutual information • chi-square • … • User Interface • For classifying test queries • Visualization of classification result • …

  11. Submission • Your submission *should* include • The source code (and your executable file) • A complete user manual (or a UI)for testing • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: two weeks (May 22, 2013)

  12. Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.39/IR/ • Username: your student ID • Password: (Please change your default password at your first login) • Preparing your submission file: as one single compressed file • Remember to specify the names of yourteam members and student ID in the files and documentation • If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)

  13. Evaluation • Randomly selected test queries will be submitted to your classifier, and checked for effectiveness (F-measure) • Minimum requirement • Training and testing phases can be successfully completed • Effectiveness for the 10 largest classes can be evaluated • Optional features will be considered as bonus • Feature selection, UI, visualization, … • You might be required to demo if the classifier submitted was unable to run by TA

  14. Any Questions or Comments?

More Related