1 / 19

Semiautomatic Generation of Resilient Data Extraction Ontologies

Semiautomatic Generation of Resilient Data Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Data Extraction Ontology. Goal: extract data from web pages Components concepts relations between the concepts participation constraints

kailey
Download Presentation

Semiautomatic Generation of Resilient Data Extraction Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semiautomatic Generation ofResilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

  2. Data Extraction Ontology • Goal: extract data from web pages • Components • concepts • relations between the concepts • participation constraints • Resilient • Difficulty: manual ontology generation is costly

  3. Data-Extraction Ontology Generation Procedure Train Test Knowledge Selection Processing Extraction Processing Database Knowledge Sources

  4. Knowledge Collection • Assumptions about knowledge base • general • contains meaningful relationships • pre-existing • XML or easy to transfer to XML • Current input • Mikrokosmos ontology [Mik] • auxiliary data frame library

  5. Selection of Concepts PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); MANY ISSUES selection strategies, conflict resolution, …

  6. Basic Selection Strategy • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. • Select from Mikrokosmos Ontology

  7. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  8. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Afghanistan<Nation> • smaller than Texas<USState>. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  9. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  10. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • extract result based on the data frames • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  11. Document-Level Conflict • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  12. Concept-Level Conflict • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000<Area> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>: 17.7 million<Population>. • Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  13. Relation Retrieval • Theoretical solution • all paths in the subgraph • too expensive: NP-Complete • Heuristic solution • find the shortest path between any two nodes • set a threshold distance

  14. Participation Constraints • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital—Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]

  15. Participation Constraints (cont.) • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul<City>, • Other cities<City>--Kandahar<City> Mazar-e-Sharif<City> Konduz<City> • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*]

  16. Performance Evaluation • Speed of generation • Precision and recall of the generation process • Precision and recall of the generated ontology

  17. Generation Time with Distance Threshold

  18. P&R of Generation Process

  19. Conclusion • Data Extraction Ontology generated • Knowledge sources exploited • Many issues applied • Many more to explore

More Related