1 / 20

A Probabilistic Model for Classification of Multiple-Record Web Documents

A Probabilistic Model for Classification of Multiple-Record Web Documents. June Tang Yiu-Kai Ng. Overview. Probabilistic Model Bayes decision theory Document and query representations Ranking-function construction Multivariant Statistical Analysis. Approach.

Download Presentation

A Probabilistic Model for Classification of Multiple-Record Web Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng

  2. Overview • Probabilistic Model • Bayes decision theory • Document and query representations • Ranking-function construction • Multivariant Statistical Analysis

  3. Approach • Constructing a rank function for a probabilistic model based on multivariant statistical analysis • Minimizing expected cost of misclassification • Deriving a classification rule • Deriving a linear classification rule • Deriving a sample linear classification rule

  4. Application Ontology

  5. Document Representation (Year, Make, Model, Mileage, Price, Feature, PhoneNr) Total records: 60 (Year:62) (Make:58) (Model:48) (Mileage:12) (Price:58) (Feature:49) (PhoneNr:33) (62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.97,0.82,0.55)

  6. Elementary Concepts • Variables are things that we measure, control, or manipulate in research • Multi-variant analysis considers multiple variables together as a single unit • Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"

  7. Multivariant Statistical Analysis • Let A be an application ontology D be a set of Web documents R be a set of relevant documents R be a set of irrelevant document X = (X1, X2, …, Xp) represent a document  be the set of all possible values on which X can take  = 1 2

  8. Expected Cost of Misclassification(ECM) Here, Two density functions f1 and f2

  9. Classification Rule

  10. Multivariate Normal Density Functions Assume that density functions are normal Where

  11. Linear Classification Rule Assume that density functions are normal and 1 , 2 , and  are equal Document x is classified as relevant if

  12. Linear Discrimination Function Threshold: ?

  13. Parameter Estimations Suppose we have n1 relevant documents and n2 irrelevant documents Such that n1+n2>=p and p is the dimension of vector x

  14. Parameter Estimations (Cont.)

  15. Sample Classification Rule Document x is classified as relevant if

  16. Misclassification Probabilities Lachenbruch’s “holdout” procedure where

  17. Precision Measure

  18. Experimental Result (Relevant)

  19. Experimental Result (Irrelevant)

  20. Conclusion • Precision: 85% (VSM: 77.5%) • Multivariant Statistical Analysis • Extendibility to Multiple Categorization Classification

More Related