220 likes | 330 Views
A Synergistic Semantic Annotation Model December 2007. Yihong Ding, http://www.deg.byu.edu/ding/. Grand challenge: new generation World Wide Web. The current Web Enormous amount content Feasible for humans to read/write But … Content is simply too much to read The future Web
E N D
A Synergistic Semantic Annotation ModelDecember 2007 Yihong Ding, http://www.deg.byu.edu/ding/
Grand challenge: new generation World Wide Web The current Web • Enormous amount content • Feasible for humans to read/write • But … • Content is simply too much to read The future Web • Even more content but machine-processable • Feasible for humans and machines to read/write • Key issue • Converting non-machine-processable content to machine-processable content, i.e., semantic annotation
AptRental Ontology Semantic annotation, the general picture Data Extraction/Instance Recognition Engine
AptRental Ontology Semantic annotation, the general picture
Ontology • Definition: Explicit, formal specifications of conceptualizations • Unique identity of each concept • Unique identity of each relationship among concepts • Logic derivation rules underneath every declared relationship • Annotation: • 533-0293 is-a AptRental:ContactPhone • $1250 is-a AptRental:MonthlyRate • 533-0293 is-about AptRentalAd-instance-1 • $1250 is-about AptRentalAd-instance-1 • Ontology: • AptRentalAd has ContactPhone • AptRentalAd has MonthlyRate • Logic derivation: • To rent the apartment that costs $1250 monthly please call 533-0293. (machine understanding)
Automated semantic annotation, methods • Layout-driven method (e.g. [Mukherjee et. al. 03]) • Machine-learning-based method (e.g. [Handschuh et. al. 02]) • Rule-based method (e.g. [Dill et. al. 03]) • NLP-based method (e.g. [Popov et. al. 03]) • Ontology-based method (e.g. [Ding et. al. 06])
Data extraction ontology Standard Ontology epistemological extension (instance recognizer) BedroomNr BedroomNr External representation Context Phrase Exception Phrase X CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call 533-0293
Ontology-based annotation BedroomNr BathNr External representation Context Phrase External representation Context Phrase CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call 533-0293 External representation External representation Context Phrase Context Keyword External representation Feature MonthRate ContactPhone
Ontology-based annotation: strength and weakness • Strengths • Ignore layout difference • Ignore layout change • Less maintenance once built • Weakness • Expensive to build instance recognizers
Layout-driven annotation, strength and weakness • Strengths • Accurate • Simple and straightforward • Less domain knowledge requirement • Weakness • Expensive in layout-pattern maintenance
Problem • How to • overcome the weaknesses • but • retaining the strengths • at the same time?
resilient accurate Observation Extraction Domain ontology Annotated Document Conceptual Annotator (ontology-based annotation) A Document Domain ontology Layout Patterns Structural Annotator (layout-driven annotation) Annotated Document A Document
Synergistic model Annotated Document Instance Recognizer Enrichment Extraction Domain ontology Annotated Document Layout Patterns Structural Annotator (layout-driven annotation) Conceptual Annotator (ontology-based annotation) Pattern Generation A Document
Pattern Generation • Get the annotated outputs from ontology-based annotator • Apply HTML-structure analysis and produce a typical layout pattern for each extracted field • If applicable, produce a sequential dependency between the generated layouts • If applicable, produce simple heuristic rules such as “if A then B” between the generated layouts
Instance recognizer enrichment • Get the annotated outputs from layout-driven annotator • Apply the results to the current corresponding instance recognizers • If recognized, continue; • Otherwise, • if dictionary-type recognizers, insert. • if regular-expression-type recognizers, try to generate a new regular expression and alert the user to check
Preliminary results Apartment Rental domain • Ontology-based annotation • 90% accuracy in average on both precision and recall for nearly all fields • Except Location and Contact Name • Layout-driven annotation • Nearly 100% accuracy on both precision and recall on Location and Contact Name • Less recall on fields such as BedroomNr • Pattern generation • Great on well structured fields such as Location • Less successful on semi-structured fields such as BedroomNr • Instance recognizer enrichment • Good results even with poorly constructed initial instance recognizers
Summary • Automatically produce layout patterns using outputs of ontology-based annotation • Automatically enrich domain-specific instance recognizers using outputs of layout-driven annotation • A new synergistic annotation model that retains original strengths and minimizes original weaknesses • An annotation system that self-improves its performance during its execution
Future work • Dynamical tuning annotation based on user perspectives • Ensemble of various annotators • Collaborative annotation
Thank you • Yihong Ding ding@cs.byu.edu • (801) 422-7604 • 2262 TMCB, Brigham Young University • Provo, UT 84601 • Data Extraction Research Lab at Brigham Young University • http://www.deg.byu.edu • Homepage, my virtual home on Web 1.0 • http://www.deg.byu.edu/ding/ • Thinking Space, my virtual home on Web 2.0 • http://yihongs-research.blogspot.com/