Department of Computer Science Brigham Young University Q Wang November, 2000

Ontology-Based Binary-Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000

Multiple-Record Web Documents-1 Relevant document--a chunk of Car-sale Ads Acura Integra 1990 $4,000 (1/27/00)ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) 226-5444.+ Acura Integra 1992 $5,900 (1/27/00)ACURA'92 Integra RS, white, excellent condition. $5,900. 410-548-1353

Multiple-Record Web Documents-2 Irrelevant document--a chunk of Motorcycle Ads '97 HONDA ACESHADOW 1100cc 4k. Customized. $7.5K/obo 410-465-0870 '97 HONDA CR250Exc. cond. $3300/OBO. (410) 479-4499

Application Ontology Year Price 1:* 1:* 0:0.975:1 0:0.8:1 1:* 1:* Make Car Model 0:0.925:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.45:1 1:* 1:* 1:* Mileage Feature PhoneNr

Document Representation • A set of <index term : term frequency> pairs A1:x1, …….. An:xn • A density heuristic value • A grouping heuristic value P(R|d) P(R|(x1,……,xn), P(R|Density), P(R|Grouping)

Independence Assumption P(R|(Year,……,Make) Independence assumption P(R|(Year) P(R|(Make)

Logistic Regression Logistic regression package Input from a training set data Output

Probability Estimation For a test document, the term frequency of index term Make is 0.4514. xMake = 0.4514 P(R| Make) = 1/(1+exp(-(C0+C1 xMake))) = 1/(1+exp(-(8.358+(-1.606*0.4514)))) = 0.9995

Probability Fitting Curve P(R| x) = 1/(1+exp(-(C0+C1 x))) P * ** * ******* P(R|xi) P(R|x) *** * ******* ** * xi x

Relevance Probability Calculation For a Car Sale document in a test set, we have Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] C0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Y = C0 * IT + C1 * XT = 134.111 P(R|d) = 1 + 1/exp(-Y) = 1

Statistical Information : P-Value • A p-value is a significance indicator. • A large p-value indicates either a bad regression model or a statistically insignificant index term. • We should keep only significant index terms.

Dependent Relations • Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. • The performance can be improved by including significant dependent relations in relevance probability calculation.

Comparison

Contribution • We propose a probabilistic model which can accurately classify multiple-record Web documents. • We will study the impact of dependent relations on the performance of our model.

Department of Computer Science Brigham Young University Q Wang November, 2000

Department of Computer Science Brigham Young University Q Wang November, 2000

Presentation Transcript

Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http:/

Physics at Brigham Young University

Brigham Young University

Brigham Young University

Michael Rice Brigham Young University

Brigham Young

Jeremy Yorgason Brigham Young University

Brigham Young University

Brigham Young University

Joseph Park Brigham Young University

Brigham Young University

Brigham Young University

Brigham Young University

Department of Computer Science, Princeton University

Ted I. Brewster Department of Microbiology Brigham Young University

Brigham Young

NASC Accreditation Brigham Young University

Dan Su Department of Computer Science Brigham Young University

Columbia University Department of Computer Science

NASC Accreditation Brigham Young University

Concordia University Department of Computer Science

Columbia University Department of Computer Science