1 / 11

Document Quality Judgment with Textual Featues

Document Quality Judgment with Textual Featues. Bing Bai Computer Science Department Rutgers University December 2003. Document Qualities. Not relevance Also important in information retrieval system Partially dependent on Textual features Document length “Coward”.

gili
Download Presentation

Document Quality Judgment with Textual Featues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Quality Judgment with Textual Featues Bing Bai Computer Science Department Rutgers University December 2003

  2. Document Qualities • Not relevance • Also important in information retrieval system • Partially dependent on Textual features • Document length • “Coward”

  3. Document Qualities(Continued) • Pre-defined Qualities • Accuracy • Credibility • Depth • Grammar Correctness • Objectivity • Multi-side • Readability • Source Authority • Verbose-Concise

  4. Textual Features • Statistics by GATE • Categories of Features

  5. Textual Features (Continued)

  6. Data Set and Testing Scheme • More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. • The Nine Qualities of these document are judged by faculty, professionals, and students. • 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. • 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time.

  7. Results

  8. Factor Anaysis • Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. • Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”.

  9. Gaussian-Bayesian Classifier • if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. • Where • Singularity elimination (Get rid of trivial eigens)

  10. GBC Results

  11. GBC (Continued) • Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). • One reason: the distributions are not Gaussian • The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity.

More Related