1 / 17

Semiautomatic domain model building from text-data

Semiautomatic domain model building from text-data. Petr Šaloun Petr Klimánek Zdenek Velart. SMAP 2011, Vigo, Spain, December 1-2, 2011. Introduction and goals. The basic tasks in creating a domain model: selection of domain and scope consideration of reusability

creda
Download Presentation

Semiautomatic domain model building from text-data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011

  2. Introduction and goals • The basic tasks in creating a domain model: • selection of domain and scope • consideration of reusability • finding a important terms • defining classes and class hierarchy • defining properties of classes and constraints • creation of instances of classes • Goals • designing a method for semiautomatic domain creation • different input documents • different languages • design and implementation of tool

  3. State of the art • Algorithm and tasks work with domain model • different document formats • different languages • domain model • concepts, relations • domain model creation = time consuming • manual creation • automatic creation • semiautomatic creation

  4. Tools and methods • natural language processing – NLP • Stanford NLP • Stanford Parser • Stanford POS tagger • Stanford Named Entity Recognizer • multi-language environment – Google Translate • WordNet (synsets) • Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG

  5. Processing of text documents <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.

  6. Processing of text documents - extraction, cleaning, translation • input TXT, HTML, PDF • removal of occurrences of specialcharacters using regular expressions • numeric designation of chapters and references • removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+ • parentheses, dashes, and other • translation into English – the tools work only with english text • Google Translate

  7. Processing of text documents - annotation • Stanford CoreNLP • Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer • machine learning over large data, statistical model of maximum entropy • learned models included • Activities • tokenization • sentence splitting • POS tagging - Part-of-speech • lemmatization • NER - Named Entity Recognition

  8. Example <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.

  9. Mining concepts • tokens marked by POS tagger as nouns are first concept candidates • one word or multi-words nouns • identifying token as concept by disambiguation from WordNet • assigning synset – automatic, manual • using domain term for searching • possible selection of incorrect synset – with other meaning

  10. Mining relations • unoriented / oriented • unnamed / named • WordNet – concept must have synset • hyperonyms and hyponyms – IsA relations • holonyms and meronyms – partOf relations • relation orientation based on concept order • only direct relations • from text • lexical-syntactic patterns • decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression • sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type

  11. Tool

  12. Experiment • ANSI/ISO C language • comparison with existing manually created ontology • 2 experiments • all concept candidates • only first 200 candidates • 3 variants of experiment • only candidates • candidates and IsA proposals • candidates and IsA proposals and NER entities

  13. First 30 candidates

  14. Experiment

  15. Experiment • Variant of experiment without IsA relations only with NER entities

  16. Conclusions and further work • concepts => lightweight ontology • enables better automatic relations mining

  17. Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava petr.saloun@vsb.cz Petr Klimánek (was: Faculty of Science, University of Ostrava) p.klimanek@gmail.com Zdenek Velart FEECS, VSB–Technical University of Ostrava zdenek.velart@gmail.com

More Related