150 likes | 156 Views
Ontology-Based Binary-Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model. Department of Computer Science Brigham Young University Q Wang November, 2000. Multiple-Record Web Documents-1. Relevant document--a chunk of Car-sale Ads.
E N D
Ontology-Based Binary-Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000
Multiple-Record Web Documents-1 Relevant document--a chunk of Car-sale Ads Acura Integra 1990 $4,000 (1/27/00)ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) 226-5444.+ Acura Integra 1992 $5,900 (1/27/00)ACURA'92 Integra RS, white, excellent condition. $5,900. 410-548-1353
Multiple-Record Web Documents-2 Irrelevant document--a chunk of Motorcycle Ads '97 HONDA ACESHADOW 1100cc 4k. Customized. $7.5K/obo 410-465-0870 '97 HONDA CR250Exc. cond. $3300/OBO. (410) 479-4499
Application Ontology Year Price 1:* 1:* 0:0.975:1 0:0.8:1 1:* 1:* Make Car Model 0:0.925:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.45:1 1:* 1:* 1:* Mileage Feature PhoneNr
Document Representation • A set of <index term : term frequency> pairs A1:x1, …….. An:xn • A density heuristic value • A grouping heuristic value P(R|d) P(R|(x1,……,xn), P(R|Density), P(R|Grouping)
Independence Assumption P(R|(Year,……,Make) Independence assumption P(R|(Year) P(R|(Make)
Logistic Regression Logistic regression package Input from a training set data Output
Probability Estimation For a test document, the term frequency of index term Make is 0.4514. xMake = 0.4514 P(R| Make) = 1/(1+exp(-(C0+C1 xMake))) = 1/(1+exp(-(8.358+(-1.606*0.4514)))) = 0.9995
Probability Fitting Curve P(R| x) = 1/(1+exp(-(C0+C1 x))) P * ** * ******* P(R|xi) P(R|x) *** * ******* ** * xi x
Relevance Probability Calculation For a Car Sale document in a test set, we have Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] C0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Y = C0 * IT + C1 * XT = 134.111 P(R|d) = 1 + 1/exp(-Y) = 1
Statistical Information : P-Value • A p-value is a significance indicator. • A large p-value indicates either a bad regression model or a statistically insignificant index term. • We should keep only significant index terms.
Dependent Relations • Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. • The performance can be improved by including significant dependent relations in relevance probability calculation.
Estimation of relevance probability-2 P(R|d) P(R|Correlation-n) Multiplication P(R|Density) P(R|Correlation-1) P(R|Year) P(R|Feature) P(R|Grouping)
Contribution • We propose a probabilistic model which can accurately classify multiple-record Web documents. • We will study the impact of dependent relations on the performance of our model.