170 likes | 314 Views
IR Homework #2. By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification. Goal: to classify each document into predefined categories Input : Reuters-21578 test collection predefined categories labeled documents for training test documents for testing
E N D
IR Homework #2 By J. H. Wang May 9, 2014
Programming Exercise #2: Text Classification • Goal: to classify each document into predefined categories • Input: Reuters-21578 test collection • predefined categories • labeled documents for training • test documents for testing • Output: a classifier for each category
Input: Training and Test Sets • Using Reuters-21578 collection • Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html • 21,578 news articles in 1987 (28.0MB uncompressed) • Distributed in 22 files in SGML format • preprocessing of SGML tags • File format: http://kdd.ics.uci.edu/databases/reuters21578/README.txt
Predefined Categories in Reuters-21578 • 5 category sets • Exchanges: 39 categories • Orgs: 56 categories • People: 267 categories • Places: 175 categories • Topics: 135 categories • In this homework, ONLY the 135 Topical categories are considered in classification • 10 largest classes • Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn
Training and Test Sets • Using Reuters-21578 for text classification • Modified Lewis (ModLewis) Split • Training: 13,625 • Test: 6,188 • Modified Apte (ModApte) Split: used in this homework • Training: 9,603 • Test: 3,299 • Modified Hayes (ModHayes) Split • Training: 20,856 • Test: 722
An Example Reuters Article Training set in ModApte split <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1"> <DATE>26-FEB-1987 15:01:01.79</DATE> <TOPICS><D>cocoa</D></TOPICS> <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> <PEOPLE></PEOPLE> … <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE> SALVADOR, Feb 26 - </DATELINE> <BODY>Showers continued throughout the week in … </BODY></TEXT> </REUTERS> Topical category Text content
Output: A Classifier • Either write your own programs or use open source tools to implement any one of the following text classification methods: • Naïve Bayes (NB) classification (Ch.13) • Rocchio classification (Ch.14) • kNN classification (Ch.14) • SVM classification (Ch.15) • …
Sec.14.1 Test Document of what class? Government Science Arts
Rocchio Classification • Definition of centroid • Where Dcis the set of all documents that belong to class c and v(d) is the vector space representation of d. • Assign test documents to the category with the closest prototype vector based on cosine similarity
Tasks and Evaluations • Your system should be able to complete the following tasks using the ModApte split in Reuters-21578 dataset • Training • Testing • Evaluation of your system • Training: efficiency • Testing: precision/recall/F-measure
Example: Rocchio Classification Training Training docs Centroid Calculation HTML Parsing centroids Test doc. Cosine Similarity class Evaluation Testing P, R, F1
Example Steps in Rocchio Classification • Parse the HTML documents in the Reuters-21578 dataset. Find out the text body, topics, and separate them into training and test document. • Body as content, topics as class • For each document, calculate the TF-IDF weights from the text body as a vector. • For each training document, calculate the centroid by summing all the vectors in each topic class. • So you will get 135 centroids, one for each topic class. • For each test document, find out the most similar centroid using cosine similarity as the class it belongs to. • Compare the class with the answer (in the topics tag), and evaluate how many test documents are correctly classified.
Optional Functionalities • Feature selection: (Sec. 13.5) • mutual information • chi-square • … • User Interface • For selecting test documents • Visualization of classification result • …
Submission • Your submission *should* include • The source code (and your executable file) • A complete user manual (or a UI)for testing • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: extended to three weeks (May 30, 2014)
Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.31/net2ftp • FTP server: localhost • User name & password: Your student ID • Preparing your submission file: as one single compressed file • Remember to specify the names of yourteam members and student ID in the files and documentation • If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)
Evaluation • Two options: • Your system automatically classifies all test documents, and displays classification results and the effectiveness (precision, recall, F-measure, accuracy) • The preferred option • Your system can randomly select some test documents (by their IDs), and run your classifier to show the classification result (both your classifier output, and the answer) • Minimum requirement • Training and testing phases can be successfully completed • Optional features will be considered as bonus • E.g. feature selection, UI, visualization, … • You might be required to demo if the classifier submitted was unable to run by TA