200 likes | 296 Views
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech. 이 은 정 2005. 2. 17. 1. The overall. OntoBuilder : Extraction of information from texts for building knowledge bases. Consist of the two modules OntoExtract and OntoWrapper.
E N D
Towards the Semantic Web 6Generating Ontologies for the Semantic Web: OntoBuilderR.H.P. Engles and T.Ch.Lech 이 은 정 2005. 2. 17
1. The overall • OntoBuilder : • Extraction of information from texts for building knowledge bases. • Consist of the two modules OntoExtract and OntoWrapper.
OntoShare User RQL RDF Ferret Spectacle Knowledge Engineer OntoEdit OIL-Core OMM LINRO Sesame OIL-Core ontologyrepository Annotated Data Repository RDF RDF pers05 731 par05 car tel about OntoWrapper OntoExtract Data Repository (external) This text is about cars even though you can’t read it 1.1 The overall architecture
1.2 OntoExtract and OntoWrapper(1/2) • OntoExtract: • Semi-automatic Ontology construction from unstructured information (natural language sources). • OntoWrapper: • Semi-automatic Ontology construction from semi-structured and structured information sources. • extract information from places on specific sites (e.g. names, email addresses, telephone numbers).
1.2 OntoExtract and OntoWrapper(2/2) • CORPORUM is dependent on a linguistic analysis of a given text, comprising normalization, tokenization and part-of-speech tagging. • Relations between concepts are defined (e.g. subClassOf relations, or InstanceOf relations). • Through semantic analysis of a domain, the tool can automatically generate relation between words within a domain. • Visualization of such semantic structures can than be used for navigation and browsing through document sets.
2. OntoExtract(1/3) • OntoExtract supports analysis of natural language texts and generates lightweight, domain specific ontologies of these texts (utilizing already existing knowledge from a central data repository). • OntoExtract is able to: • analysis of natural language, • provide initial ontologies, • refine existing ontologies, • find relations between key terms in documents, • find instances of concepts within document, • finds classes, sub-class relationships.
2. OntoExtract(2/3) • How does OntoExtract currently work: • parses, tokenizes and analyses text, • generates nodes and relations between them, • enhances specific aspects of the discovered knowledge item using a background repository(containing general knowledge of the world, represented in Sesame), • and the final analysis results are submitted to the RDFS server Sesame.
rdf:Class rdf:Class rdf:type rdf:type hasSize weaklyRelatedTo motorcycle holidays “long” rdf:type hasColor MC_001 “black” Sesame domain knowledge Sesame background knowledge
3. OntoWrapper • OntoWrapper • deal with the analysis of structured pages • allow the user to define XML/RDF templates, variables and rule sets to perform a structured analysis of a specific domain • generate the merged output and sending it to the Sesame repository as data statements about specific pages.
4.1 Generating Semantic Structures(1/2) • Generation of semantic knowledge in information extraction is based upon the result of parsing steps that can be of varying ‘analysis depth’. • Level of Linguistic Analysis • Tokenization • Lexical/Morphological Analysis • POS tagging • Syntactic Analysis • Semantic/Pragmatic Analysis • Discourse Analysis • CORPORUM’s lexical analysis includes: text normalization, tokenization, POS tagging
4.1 Generating Semantic Structures(2/2) • In OntoExtract the initial analysed and annotated text is transformed into an internal representation that makes use of a variety of linguistic analysis steps to come to an initial interpretation of what is written. • Representation contains the original text, its annotations, but also the resolutions performed on it. • The semantic structures undergo a translation such a more formal representation.
4.2 Generating Ontologies from Textual Resources • How the translation from linguistics into formalisms can be done properly • problem of representation level : what knowledge should be represented at the ontology level/ fact level (what represents an ‘instance’/ ‘concept’) • problem of dealing with the inheritance problem • consistency between extracted ontologies and their truth within specific domains • Ontologies are extracted from single documents taken from the web( concepts are extracted, created). These are set into relation with each other, augmented with properties and found instances are hooked up to them.
4.3 Visualization and Navigation • The exported semantic network structures and be run through a graph layout algorithm in order to generate visualizations (with CCA viewer). • Intercluster relationships are used to navigate from one cluster to another by relevant concepts.
5. Issues in Using Automated Text Extraction for Ontology Building using IE on Web Resources • Internet has an additional challenge : multi-cultural background of the authors • Generated ontologies can be used as ‘seed ontologies’ , automatically generated from a variety of user defined documents.