50 likes | 192 Views
CS 548 – Project 5. Text Classification. Skyler Whorton – April 26, 2012. Data Set. UCI 4-Universities Data Set Cornell, Texas, Washington, Wisconsin plus “ misc ” sources Computer Science department web pages in 7 categories: Course Department Faculty Project Staff Student
E N D
CS 548 – Project 5 Text Classification Skyler Whorton – April 26, 2012
Data Set • UCI 4-Universities Data Set • Cornell, Texas, Washington, Wisconsin plus “misc” sources • Computer Science department web pages in 7 categories: • Course • Department • Faculty • Project • Staff • Student • “Other” – confounding label • WPI Computer Science Department • Manually picked test set • Each class label covered
Pre-Processing, Objectives • HTML webpage → Text document • HTML header, tags, special characters • Dates, numbers, source-specific words • Remove instances with “other” label • Training and test data • Leave-one-out validation on each of the 4 sources • Extra test set: WPI web page collection • Objectives & guiding questions • Compare single word vs. N-gram tokenization • Find words/N-grams predictive of document-type • What pre-processing/settings strongly affect classifier performance • Which classifier generalizes best?
Experiment Process Texas Raw HTML pages WPI Cornell Washington Wisconsin Cleaned text TestingWPI TestingCornell Training“Sans-Cornell” Document to word vector WordTokenizer N-GramTokenizer J48 Tree Training Vector Test Vectors NaïveBayes Evaluation SMO Classifiers
Results • 4-Universities • NaïveBayes: cross-validation • J48: test set • WPI web pagesmuch different • All-Universities • Removed “other” • SMO: high acuracy,but overfit • Predictive words: • professor, computer science, ph d, assignments, syllabus, research, technical reports, computer science, I am, student in, groups, s home classification accuracy