200 likes | 306 Views
A Probabilistic Model for Classification of Multiple-Record Web Documents. June Tang Yiu-Kai Ng. Overview. Probabilistic Model Bayes decision theory Document and query representations Ranking-function construction Multivariant Statistical Analysis. Approach.
E N D
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng
Overview • Probabilistic Model • Bayes decision theory • Document and query representations • Ranking-function construction • Multivariant Statistical Analysis
Approach • Constructing a rank function for a probabilistic model based on multivariant statistical analysis • Minimizing expected cost of misclassification • Deriving a classification rule • Deriving a linear classification rule • Deriving a sample linear classification rule
Document Representation (Year, Make, Model, Mileage, Price, Feature, PhoneNr) Total records: 60 (Year:62) (Make:58) (Model:48) (Mileage:12) (Price:58) (Feature:49) (PhoneNr:33) (62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.97,0.82,0.55)
Elementary Concepts • Variables are things that we measure, control, or manipulate in research • Multi-variant analysis considers multiple variables together as a single unit • Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"
Multivariant Statistical Analysis • Let A be an application ontology D be a set of Web documents R be a set of relevant documents R be a set of irrelevant document X = (X1, X2, …, Xp) represent a document be the set of all possible values on which X can take = 1 2
Expected Cost of Misclassification(ECM) Here, Two density functions f1 and f2
Multivariate Normal Density Functions Assume that density functions are normal Where
Linear Classification Rule Assume that density functions are normal and 1 , 2 , and are equal Document x is classified as relevant if
Linear Discrimination Function Threshold: ?
Parameter Estimations Suppose we have n1 relevant documents and n2 irrelevant documents Such that n1+n2>=p and p is the dimension of vector x
Sample Classification Rule Document x is classified as relevant if
Misclassification Probabilities Lachenbruch’s “holdout” procedure where
Conclusion • Precision: 85% (VSM: 77.5%) • Multivariant Statistical Analysis • Extendibility to Multiple Categorization Classification