240 likes | 254 Views
Conceptual-Model-Based Web Data Extraction by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly.
E N D
Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored byNSF
Motivation • Data-rich Websites in abundance • Conceptual-Model-Based Methodology is resilient • “By Example” approach is user-friendly
“By Example” Approach • Web users specify desired information by creating a form • Users collect sample pages on the Web • An ontology generator learns the task by analyzing the form and the sample pages • Interactions may be needed to improve or complete the ontology
Extraction Ontology Architecture Sample Pages Data Frame Libraries Ontology Generator User CreatedForm GUI Populated Database Extraction Engine Target Pages
Digital Camera Brand Model CCD Resolution 4.0 Image Resolution 2272 x 1074 Optical Zoom 3 Digital Zoom 2 Sample Web Page User Created Form Canon PowerShot G2
Extraction Ontology • Relationship Set and Constraints • Extraction Patterns • Keywords • Context Expressions
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera[-> object]; DigitalCamera[0:1] has Brand [1:*]; DigitalCamera[0:1] has Model [1:*]; DigitalCamera[0:1] has CCDResolution [1:*]; DigitalCamera[0:1] has ImageResolution [1:*]; DigitalCamera[0:1] has OpticalZoom [1:*]; DigitalCamera[0:1] has DigitalZoom [1:*]; Relationship Set and Constraints
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] hasBrand[1:*]; DigitalCamera [0:1] hasModel[1:*]; DigitalCamera [0:1] hasCCDResolution[1:*]; DigitalCamera [0:1] hasImageResolution[1:*]; DigitalCamera [0:1] hasOpticalZoom[1:*]; DigitalCamera [0:1] hasDigitalZoom[1:*]; Relationship Set and Constraints
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera[0:1]has Brand[1:*]; DigitalCamera[0:1]hasModel[1:*]; DigitalCamera[0:1]hasCCDResolution[1:*]; DigitalCamera[0:1]hasImageResolution[1:*]; DigitalCamera[0:1]hasOpticalZoom[1:*]; DigitalCamera[0:1]hasDigitalZoom[1:*]; Relationship Set and Constraints
Extraction Patterns From Data Frame Libraries • Data Frame Libraries • Lexicons • Synonym Dictionary • Regular Expressions • Extraction Pattern: • Lexicons for Brand and Model • Regular Expressions for numbers and Image resolution
Extraction Patterns Data Frame Libraries • Features a high-quality 4.0 Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD • 3 effective megapixel CCDResolution matches [20] constant{extract "\b\d(\.\d{1,2})?\b";}; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";
Keywords • Features a high-quality4.0Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting3.34Megapixel CCD • 3effective megapixel
Keywords • Features a high-quality4.0MegapixelResolution CCD • The new Nikon Coolpix 995 offers a boasting3.34MegapixelCCD • 3effectivemegapixel
Keywords • Features a high-quality4.0Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting3.34Megapixel CCD • 3effective megapixel CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";
Context Expressions • 3.5xoptical zoom (2.5x digital) • a superior4xOptical Zoom Nikkor lens, plus4xstepless digital zoom • optical3X/digital6Xzoom OpticalZoom matches [10] constant{ extract "\b\d(\.\d)?"; context "\b\d(\.\d)?(x)\b";}; keyword "\boptical\b";
Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end;
Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end;
Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end;
Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword"\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword"\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context"\b\d(\.\d)?(x)\b"; }; keyword"\boptical\b"; end;
Summary and Future Work • The example indicates that the approach is feasible • Some open questions need to be explored