140 likes | 227 Views
IR Homework #3. By J. H. Wang May 8, 2013. Programming Exercise #3: Text Classification. Goal: to classify each document into predefined categories Input : Reuters-21578 test collection predefined categories labeled documents for training test documents for testing
E N D
IR Homework #3 By J. H. Wang May 8, 2013
Programming Exercise #3: Text Classification • Goal: to classify each document into predefined categories • Input: Reuters-21578 test collection • predefined categories • labeled documents for training • test documents for testing • Output: a classifier for each category
Input: Training and Test Sets • Using Reuters-21578 collection • Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html • 21,578 news articles in 1987 (28.0MB uncompressed) • Distributed in 22 files in SGML format • preprocessing of SGML tags • File format: http://kdd.ics.uci.edu/databases/reuters21578/README.txt
Predefined Categories in Reuters-21578 • 5 category sets • Exchanges: 39 categories • Orgs: 56 categories • People: 267 categories • Places: 175 categories • Topics: 135 categories • In this homework, we ONLY consider classification in the 135 Topical categories • 10 largest classes • Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn
Training and Test Sets • Using Reuters-21578 for text classification • Modified Lewis (ModLewis) Split • Training: 13,625 • Test: 6,188 • Modified Apte (ModApte) Split: used in this homework • Training: 9,603 • Test: 3,299 • Modified Hayes (ModHayes) Split • Training: 20,856 • Test: 722
Output: A Classifier • Either your own program(s) or open source tools • Naïve Bayes (NB) classification (Ch.13) • Rocchio classification (Ch.14) • kNN classification (Ch.14) • SVM classification (Ch.15) • …
Sec.14.1 Test Document of what class? Government Science Arts
Rocchio Classification • Definition of centroid • Where Dcis the set of all documents that belong to class c and v(d) is the vector space representation of d. • Assign test documents to the category with the closest prototype vector based on cosine similarity
Evaluation of Classification Results • Test queries randomly selected from Reuters-21578 test set • Training: efficiency • Testing: precision/recall/F-measure
Optional Functionalities • Feature selection: (Sec. 13.5) • mutual information • chi-square • … • User Interface • For classifying test queries • Visualization of classification result • …
Submission • Your submission *should* include • The source code (and your executable file) • A complete user manual (or a UI)for testing • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: two weeks (May 22, 2013)
Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.39/IR/ • Username: your student ID • Password: (Please change your default password at your first login) • Preparing your submission file: as one single compressed file • Remember to specify the names of yourteam members and student ID in the files and documentation • If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)
Evaluation • Randomly selected test queries will be submitted to your classifier, and checked for effectiveness (F-measure) • Minimum requirement • Training and testing phases can be successfully completed • Effectiveness for the 10 largest classes can be evaluated • Optional features will be considered as bonus • Feature selection, UI, visualization, … • You might be required to demo if the classifier submitted was unable to run by TA