130 likes | 242 Views
Data-Extraction Ontology Generation by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Semi-structured Web data need to be extracted for further manipulations.
E N D
Data-Extraction Ontology Generation by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored byNSF
Motivation • Semi-structured Web data need to be extracted for further manipulations. • Contrast to other wrapper generation techniques, BYU ontology-based data-extraction technique is resilient. • By-Example approach makes it possible to help common users generate ontologies easily.
Canon PowerShot S40 4.0 1600 x 1200 1024 x 768 640 x 480 Web-based System GUI
Extraction Ontology Architecture Data Frame Library Sample Pages Ontology Generator User DefinedForm System GUI Populated Database Extraction Engine Test Pages
Extraction Ontology • Object and Relationship Sets and Constraints • Extraction Patterns • Keywords and Context Expressions
Ontology GenerationObject and Relationship Sets and Constraints Base [0:1] A [1:*] Base [0:2] B [1:*] Base [0:*] C [1:*] Base [0:2] D1 [1:*] D2 [1:*] Base [0:*] E1 [1:*] E2 [1:*]
A [0:1] F [1:*] B1 [0:1] G [1:*] B1, B2 : B … … B2 [0:1] H [1:*] I [1:*] Ontology GenerationObject and Relationship Sets and Constraints
Ontology GenerationExtraction Patterns • Data Frame Library • Lexicons • Synonym Dictionaries or thesauri • Regular Expressions • Matching extraction patterns: • Only one • More than one (use extraction pattern filters) • None (create one)
Ontology GenerationKeywords and Context Expressions • 3.5x optical zoom (2.5xdigital) • a superior4x Optical Zoom Nikkor lens, plus4x stepless digital zoom • optical 3X/digital6Xzoom
Sample Web Page Canon PowerShot G2 4.0 2272 x 1074 3 2 User Defined Forms Object and Relationship Sets and Constraints DigitalCamera [-> object] DigitalCamera [0:1] Brand [1:*] DigitalCamera [0:1] Model [1:*] DigitalCamera [0:1] CCDResolution [1:*] DigitalCamera [0:1] ImageResolution [1:*] DigitalCamera [0:1] Zoom [1:*] Zoom [0:1] DigitalZoom [1:*] Zoom [0:1] OpticalZoom [1:*]
Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] Brand [1:*]; DigitalCamera [0:1] ImageResolution [1:*]; DigitalCamera [0:1] Zoom [1:*]; DigitalCamera [0:1] CCDResolution [1:*]; Zoom[0:1] OpticalZoom[1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; CCD Resolution matches [20] constant{ extract"\b\d(\.\d{1,2})?\b"; }; keyword"\bMegapixel\b”, "\bCCD\b", "\bCCD Resolution\b"; end; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context"\b\d(\.\d)?(x)\b";}; keyword"\boptical\b"; end;
Measurements • How much of the ontology was generated with respect to how much could have been generated? • How many components generated should not have been generated? • What comparisons can we make about the precision and recall ratios of extraction data between a system-generated ontology and an expert-generated ontology? • How many sample pages are necessary for acceptable system performance?
Contributions • Proposes a by-example approach to semi-automatically generate data-extraction ontologies • Constructs a Web-based tool to generate data-extraction ontologies