110 likes | 218 Views
Document Quality Judgment with Textual Featues. Bing Bai Computer Science Department Rutgers University December 2003. Document Qualities. Not relevance Also important in information retrieval system Partially dependent on Textual features Document length “Coward”.
E N D
Document Quality Judgment with Textual Featues Bing Bai Computer Science Department Rutgers University December 2003
Document Qualities • Not relevance • Also important in information retrieval system • Partially dependent on Textual features • Document length • “Coward”
Document Qualities(Continued) • Pre-defined Qualities • Accuracy • Credibility • Depth • Grammar Correctness • Objectivity • Multi-side • Readability • Source Authority • Verbose-Concise
Textual Features • Statistics by GATE • Categories of Features
Data Set and Testing Scheme • More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. • The Nine Qualities of these document are judged by faculty, professionals, and students. • 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. • 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time.
Factor Anaysis • Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. • Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”.
Gaussian-Bayesian Classifier • if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. • Where • Singularity elimination (Get rid of trivial eigens)
GBC (Continued) • Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). • One reason: the distributions are not Gaussian • The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity.