130 likes | 251 Views
Filtering Multiple-Record Web Documents Based on Application Ontologies. Presenter: L. Xu Advisor: D.W.Embley. D1: Car. D2: Item for Sale or Rent. Examples. Car Ontology. Car[->object]; Car[0..0.975..1] has Year; Car[0..0.925..1] has Make; Car[0..0.908..1] has Model;
E N D
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley
D1: Car D2: Item for Sale or Rent Examples
Car Ontology Car[->object]; Car[0..0.975..1] has Year; Car[0..0.925..1] has Make; Car[0..0.908..1] has Model; Car[0..0.45..1] has Mileage; Car[0..2.1..*] has Feature; Car[0..0.8..1] has Price; PhoneNr is for Car[1..1.15..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, . . End;
Filtering Heuristics • H1: Density • H2: Expected-values • H3: Grouping
H1: Density • Car • Total Number of Characters: 2048 • Number of Matched Characters: 626 • Density: 0.306 • Item for Rent or Sale • Total Number of Characters: 196 • Number of Matched Characters: 2671 • Density: 0.073
H2: Expected-values OV D1 D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 15 11 D1: 0.996 D2: 0.567 D1 ov D2
H3: Grouping Year: 2000 Year: 1989 Make: Subaru Model: SW------ Nr of Distinct "One Max" Object:3 Price: 1900 Year: 1998 Model: Elantra Year: 1994------ Nr of Distinct "One Max" Object:3 . . . Grouping Factor is: 0.865 Year: 1999 Year: 1998 Year: 1960 Mileage: 10000 Nr of Distinct "One Max" Object:2 Mileage: 401000 Year: 1940 Price: 17500 Year: 10971 Nr of Distinct "One Max" Object: 3 . . . Grouping Factor is: 0.5
Combining Heuristics • Decision tree learning algorithm C4.5 • Learning task: suitability • Performance measure: accuracy • Training experience: human classified documents • Training set • 20 positive examples (from 10 geographical regions of US States) • 30 negative examples • Test set • 10 positive examples • 20 negative examples
Generated Rules • Car application • H2 <= 0.8767:NO • H2 > 0.8767:YES • Obituary application • H2 <= 0.6793:NO • H2 > 0.6793 • | H1 <= 0.2171:NO • | H1 > 0.2171:YES • Universal rule • H3 <= 0.625 • | H1 <= 0.369: NO • | H1 > 0.369 • | | H2 <= 0.6263: NO • | | H2 > 0.6263: YES • H3 > 0.625: YES
Experiment Results • Car application • accuracy 96.7% • precision 100% • recall 91% • Obituary application • accuracy 96.7% • precision 91% • recall 100% • Universal rule • accuracy 93.4% • precision 84% • recall 100%
Summary • Objective: Automatically filter multiple-record web documents. • Approach: Filtering heuristics • Density • Expected-values • Grouping • Result: ~95% accuracy