1 / 20

Recognizing Ontology-Applicable Multiple-Record Web Documents

Recognizing Ontology-Applicable Multiple-Record Web Documents. David W. Embley Dennis Ng Li Xu. Brigham Young University. Problem: Recognizing Applicable Documents. Document 1: Car Ads. Document 2: Items for Sale or Rent. A Conceptual Modeling Solution. Car-Ads Ontology. Car [->object];

josef
Download Presentation

Recognizing Ontology-Applicable Multiple-Record Web Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognizing Ontology-ApplicableMultiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

  2. Problem: Recognizing Applicable Documents Document 1: Car Ads Document 2: Items for Sale or Rent

  3. A Conceptual Modeling Solution

  4. Car-Ads Ontology Car [->object]; Car [0:0.975:1] has Year [1:*]; Car [0:0.925:1] has Make [1:*]; Car [0:0.908:1] has Model [1:*]; Car [0:0.45:1] has Mileage [1:*]; Car [0:2.1:*] has Feature [1:*]; Car [0:0.8:1] has Price [1:*]; PhoneNr [1:*] is for Car [1:1.15:*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;

  5. Recognition Heuristics • H1: Density • H2: Expected Values • H3: Grouping

  6. H1: Density Document 1: Car Ads Document 2: Items for Sale or Rent

  7. H1: Density • Car Ads • Number of Matched Characters: 626 • Total Number of Characters: 2048 • Density: 0.306 • Items for Rent or Sale • Number of Matched Characters: 196 • Total Number of Characters: 2671 • Density: 0.073

  8. H2: Expected Values Document 1: Car Ads Document 2: Items for Sale or Rent Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4

  9. H2: Expected Values OV D1 D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 15 11 D1: 0.996 D2: 0.567 D1 ov D2

  10. H3: Grouping (of 1-Max Object Sets) Document 1: Car Ads Document 2: Items for Sale or Rent { Year Make Model Price Year Model Year Make Model Mileage … { { { Year Mileage … Mileage Year Price Price … {

  11. H3: Grouping 2+3+2+1 44 3+3+4+4 44 = 0.875 = 0.500 Car Ads ---------------- Year Year Make Model -------------- 3 Price Year Model Year ---------------3 Make Model Mileage Year ---------------4 Model Mileage Price Year ---------------4 … Grouping: 0.865 Sale Items ---------------- Year Year Year Mileage -------------- 2 Mileage Year Price Price ---------------3 Year Price Price Year ---------------2 Price Price Price Price ---------------1 … Grouping: 0.500 Expected Number in Group =   Ave  = 4 (for our example) 1-Max Sum of Distinct 1-Max in each Group Number of Groups  Expected Number in a Group

  12. Combining Heuristics • Decision-Tree Learning Algorithm C4.5 • (H1, H2, H3, Positive) • (H1, H2, H3, Negative) • Training Set • 20 positive examples • 30 negative examples (some purposely similar, e.g. classified ads) • Test Set • 10 positive examples • 20 negative examples

  13. Car Ads: Rule & Results • Precision: 100% • Recall: 91% • Accuracy 97% • Harmonic Mean • 2/(1/Precision + 1/Recall)

  14. False Negative

  15. Obituaries

  16. Obituaries: Rule & Results • Precision: 91% • Recall: 100% • Accuracy: 97%

  17. False Positive: Missing Person Report

  18. Universal Rule • Precision: 84% • Recall: 100% • Accuracy: 93%

  19. Additional and Future Work • Other Approaches • Naïve Bayes [McCallum96] (accuracy near 90%) • Logistic Regression [Wang01] (accuracy near 95%) • Multivariate Analysis with Continuous Random Vectors [Tang01] (accuracy near 100%) • More Extensive Testing • Similar documents (motorcycles, wedding announcements, …) • Accuracy drops to near 87% • Naïve Bayes drops to near 77% • Others … ? • Other Types of Documents • XML Documents • Forms and the Hidden Web • Tables

  20. Summary • Objective: Automatically Recognize Document Applicability • Approach: • Conceptual Modeling • Recognition Heuristics • Density • Expected Values • Grouping • Result: Accuracy Near 95% www.deg.byu.edu

More Related