470 likes | 989 Views
Web Page Classification by Academic Fields. Richard Wang February 15, 2006. Introduction. Objective Train a classifier that classifies web pages by academic field using semi-supervised method Identify interests/affiliations of people
E N D
Web Page Classification by Academic Fields Richard Wang February 15, 2006
Introduction • Objective • Train a classifier that classifies web pages by academic field using semi-supervised method • Identify interests/affiliations of people • Filter web pages for field-specific applications (i.e. an N.E.R. trained on C.S. web pages) • Assumptions • Academic fields correspond to academic departments • All web pages under an academic departmental website is related to the academic field that the department corresponds to
Academic Fields • We pre-define six academic fields (also showing an example of each of their academic departmental URLs): • Biological Sciences (i.e. web.mit.edu/biology/www) • Computer Science (i.e. www.cs.cmu.edu) • Economics (i.e. www.econ.gatech.edu) • History (i.e. www.nyu.edu/gsas/dept/history) • Law (i.e. www.law.miami.edu) • Music (i.e. www.pitt.edu/~musicdpt)
System Architecture External Module (Optional) True Dept. URLs (Field, URLs) Candidate Dept. Pages (Field?, URLs, Pages) Web Crawler Web Crawler Candidate Dept. URLs (Field?, URLs) True Dept. Pages (Field, Pages) Simple URL Classifier Web Page Classifier If Match Google Academic Field Queries
Candidate Dept. URLs • Manually devised Google queries for extracting candidate departmental URLs: • The extracted URLs are then sent to • A simple URL classifier • The web crawler for crawling allintitle: "Biological Sciences" OR Biology School OR Department OR Institute site:edu allintitle: "Computer Science" -Mathematics School OR Department OR Institute site:edu allintitle: Economics School OR Department OR Institute site:edu allintitle: History -Art School OR Department OR Institute site:edu allintitle: Law School OR Department OR Institute site:edu allintitle: Music School OR Department OR Institute site:edu
Simple URL Classifier • Learns URL from candidate dept. URLs by keeping count of their term frequencies • The classifier determines the academic field of a URL by searching for those top URL tokens
System Architecture External Module (Optional) True Dept. URLs (Field, URLs) Candidate Dept. Pages (Field?, URLs, Pages) Web Crawler Web Crawler Candidate Dept. URLs (Field?, URLs) True Dept. Pages (Field, Pages) Simple URL Classifier Web Page Classifier If Match Google Academic Field Queries
Web Page Classifier • Since learning is iterative, we need a fast non-binary classifier: • KNN is fast during training but extremely slow during testing • One vs. All learner that uses a simple inner learner can be very fast during training and testing • We decided to use One vs. All with Naïve Bayes as the inner learner and a simple set of features: bag-of-words
System Architecture External Module (Optional) True Dept. URLs (Field, URLs) Candidate Dept. Pages (Field?, URLs, Pages) Web Crawler Web Crawler Candidate Dept. URLs (Field?, URLs) True Dept. Pages (Field, Pages) Simple URL Classifier Web Page Classifier If Match Google Academic Field Queries
Experimental Setting • Initial training set (seed) • One entire website for each academic field • Manually verified that those websites are indeed departmental websites • A total of 15880 web pages (18MB) • Test set • Same setting as the initial training set but with different websites • A total of 1824 web pages (2MB)
Classifier Analysis (1) Biological Sciences Computer Science
Classifier Analysis (2) Economics History
Classifier Analysis (3) Law Music
Conclusion & Future Work • Classification performance can be improved by using unlabeled data • Try more iterations in the experiments • Try to learn/classify more academic fields • Try other multi-class classifiers
Thank You • Questions?