1 / 11

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Introduction. Wrapper-driven data extraction Pros: data-source-specified, high performance Cons: lack of resiliency and scalability

debra
Download Presentation

Semiautomatic Generation of Resilient Data-Extraction Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

  2. Introduction • Wrapper-driven data extraction • Pros: data-source-specified, high performance • Cons: lack of resiliency and scalability • Ontology-driven data extraction • Pros: application-domain-specified, resilient and scalable • Cons: hard to create • Objective • Generating data-extraction ontologies

  3. test documents training documents clean records Application Specification Ontology Generation Domain Allocation Data Extraction Ontology Knowledge Preparation Generation Architecture pre-processing interact if necessary Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources

  4. KNOWLEDGE BASE K Ontology Lexicons Data-Frame Library Synonym Dictionary (WordNet) Knowledge Base Construction • Knowledge Sources • Mikrokosmos (K) Ontology • Data-Frame Library • Additional Lexicons • WordNet • Integration of Knowledge Base

  5. Application Specification Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

  6. retail 02 Buick Century Pwr Seat, Nada Retail13,695. by keyword identification Data Frame Library <Price> <Mileage> Domain Allocation: concept selection • Select concepts using string-matching with object values • Resolve conflict by context or semantic meanings

  7. <AUTOMOBILE> <MAKE> <FEATURE> <PRICE> <YEAR> <PHONE> <TEMPORAL-UNIT> Domain Allocation: relationship retrieval • Find paths among selected concept nodes • Retrieve cluster representing application domain Record 1: 00GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02BuickCentury Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250 Record 3: 02BuickCentury, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00BuickCentury Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

  8. <AUTOMOBILE> <AUTOMOBILE> <MAKE> <MAKE> <FEATURE> <FEATURE> <PRICE> <PRICE> Domain Allocation: constraint discovery • Discover participation times for each object values • Specify discovered values to be participation constraints 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 AUTOMOBILE [0:1] has MAKE [1:*] AUTOMOBILE [0:*] has FEATURE [1:*] 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah AUTOMOBILE [0:1] has PRICE [1:1]

  9. Ontology Generation • Initial ontology: automatically generated • Updated ontology: user tuning • Expectation • Rejecting existence much easier than adding new • Modification as less as possible

  10. Evaluation and Results • Evaluation • Compare: Generated vs. Expert-created • POG (Precision of Ontology Generation) • PROG (Pseudo-Recall of Ontology Generation) • EPROG (Effective-PROG) • Results • Three testing domains: Apt-Rental, Used-Auto-Ads, Nation-Essence • Average POG less than 0.23 • Lowest EPROG is around 0.70, highest is almost 1.0

  11. Conclusion • Exploits existing knowledge • Specifies application domain • Allocates domain inside the knowledge base • Generates a data-extraction ontology • Shows effective recall of more than 70% on average

More Related