Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007

Automatic Syllabus ClassificationJCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones, William Cameron, GuoFang Teng, and Lillian (“Boots”) Cassel

Why Study the Syllabus Genre? • Educational resource • Importance to the educational community • Educators • Students • Self-learners • Thanks to NSF DUE grant 5328255 (personalization support for NSDL)

Where to look for a specific syllabus? • Non-standard publishing mechanisms: • Instructor’s website • CMSs (courseware management systems, e.g., Sakai) • Catalogs • Limited access outside the university • Search on the Web • Many non-relevant links in search results

Syllabus Library • Bootstrapping • Identify true syllabi from search results • Store in a repository • Develop tools & applications • Scaling up • Encourage contributions from educational communities

An Essential Step towards Syllabus Library: Classification • Classification Objects: • Potential syllabi in Computer Science: search on the Web, using syllabus keywords, only in the educational domains • Class Definition • Feature Selection • Model Selection • Training and Testing

Four Classes Noise

Full Syllabus

Partial Syllabus

Entry Page

Noise

course code title class time& location offering institution teaching staff course description objectives web site prerequisite textbook grading policy schedule assignment exam and resources Syllabus Components

Features • 84 Genre-specific Features • the occurrences of keywords • the positions of keywords, and • the co-occurrences of keywords and links • A series of keywords for each syllabus component

Classification Models • Discriminative Models • Support Vector Machines (SVM) • SMO-L: Sequential Minimal Optimization, accelerating the training process of SVM • SMO-P: SMO with a polynomial kernel • Generative Models • Naïve Bayes (NB) • NB-K: Applying kernel methods to estimate the distribution of numeric attributes in NB modeling

Evaluation • Training corpus: 1020 out of the 8000+ potential syllabi • All in HTML, PDF, PostScript, or Text • Manual tagging on the training corpus • Unanimous agreement by three co-authors • Evaluation strategy: ten-fold cross validation • Metrics: F1 (an overall measure of classification performance)

Results w. random set Best items are in purple boxes. Acctr: Classification accuracy on the training set.

Results (Cont’d) • SVM outperforms NB regarding our syllabus classification on average. • All classifiers fail in identifying the partial syllabus class. • The kernel settings for NB are not helpful in the syllabus classification task. • Classification accuracy on training data is not that good.

Future Work • Feature selection • Add general feature selection methods on text classification • e.g., Document Frequency, Information Gain, and Mutual Information • Hybrid: combine our genre-specific features with the general features

Future Work (Cont’d) • Syllabus Library • Welcome to http://doc.cs.vt.edu • Share your favorite course resources – not limited to the syllabus genre. • Information Extraction • Semantic search • Personalization

Summary • Towards a syllabus library • Starting from search results on the web • Classification of the search results for true syllabi • SVM is a better choice for our syllabus classification task. • Towards an educational on-line community around the syllabus library

Q & A

Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007