110 likes | 124 Views
This research explores semiautomatic generation of resilient data extraction ontologies by leveraging existing knowledge and specifying application domains. The approach combines wrapper-driven and ontology-driven data extraction techniques to create ontologies that are resilient and scalable. The generated ontologies are evaluated and compared against expert-created ontologies in three testing domains, showing an effective recall of over 70% on average.
E N D
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
Introduction • Wrapper-driven data extraction • Pros: data-source-specified, high performance • Cons: lack of resiliency and scalability • Ontology-driven data extraction • Pros: application-domain-specified, resilient and scalable • Cons: hard to create • Objective • Generating data-extraction ontologies
test documents training documents clean records Application Specification Ontology Generation Domain Allocation Data Extraction Ontology Knowledge Preparation Generation Architecture pre-processing interact if necessary Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources
KNOWLEDGE BASE K Ontology Lexicons Data-Frame Library Synonym Dictionary (WordNet) Knowledge Base Construction • Knowledge Sources • Mikrokosmos (K) Ontology • Data-Frame Library • Additional Lexicons • WordNet • Integration of Knowledge Base
Application Specification Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
retail 02 Buick Century Pwr Seat, Nada Retail13,695. by keyword identification Data Frame Library <Price> <Mileage> Domain Allocation: concept selection • Select concepts using string-matching with object values • Resolve conflict by context or semantic meanings
<AUTOMOBILE> <MAKE> <FEATURE> <PRICE> <YEAR> <PHONE> <TEMPORAL-UNIT> Domain Allocation: relationship retrieval • Find paths among selected concept nodes • Retrieve cluster representing application domain Record 1: 00GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02BuickCentury Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250 Record 3: 02BuickCentury, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00BuickCentury Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
<AUTOMOBILE> <AUTOMOBILE> <MAKE> <MAKE> <FEATURE> <FEATURE> <PRICE> <PRICE> Domain Allocation: constraint discovery • Discover participation times for each object values • Specify discovered values to be participation constraints 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 AUTOMOBILE [0:1] has MAKE [1:*] AUTOMOBILE [0:*] has FEATURE [1:*] 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah AUTOMOBILE [0:1] has PRICE [1:1]
Ontology Generation • Initial ontology: automatically generated • Updated ontology: user tuning • Expectation • Rejecting existence much easier than adding new • Modification as less as possible
Evaluation and Results • Evaluation • Compare: Generated vs. Expert-created • POG (Precision of Ontology Generation) • PROG (Pseudo-Recall of Ontology Generation) • EPROG (Effective-PROG) • Results • Three testing domains: Apt-Rental, Used-Auto-Ads, Nation-Essence • Average POG less than 0.23 • Lowest EPROG is around 0.70, highest is almost 1.0
Conclusion • Exploits existing knowledge • Specifies application domain • Allocates domain inside the knowledge base • Generates a data-extraction ontology • Shows effective recall of more than 70% on average