550 likes | 761 Views
Ontologies of Linguistic Annotation Towards the application of Semantic Web technologies in corpus linguistics and NLP. Christian Chiarcos chiarcos@uni-potsdam.de. 1. Ontologies of Linguistic Annotation. Background How to deal with the heterogeneity of linguistic annotations ?
E N D
Ontologies of Linguistic Annotation Towards the application of Semantic Web technologies in corpus linguistics and NLP Christian Chiarcos chiarcos@uni-potsdam.de 1
Ontologies of Linguistic Annotation • Background • How to deal with the heterogeneity of linguistic annotations ? • Ontologies of Linguistic Annotation (OLiA) • Linking annotations and terminology repositories • Applications • Corpus querying • NLP
BackgroundThe task • Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names. • The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data. (Ide & Romary 2004)
BackgroundThe solution I General Ontology of Linguistic Description (GOLD) • ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ... • ... the data and the various encoding schemes in which they are represented need an explicit semantics. • ... a data model ... which is consistent with .... the Semantic Web ... (Farrar & Langendoen 2003)
BackgroundThe solution II ISO TC37/SC4 Data Category Registry (DCR) • ... a family of data category standards designed to meet the needs of terminologists and other language experts developing a variety of electronic linguistic resources. ... • ... to ensure interoperability among these domains ... • ... with an eye to facilitating ... wide-scaleinformation handling environments such as the Semantic Web ... (Wright 2004)
BackgroundThe solution III-VIII Documentation standards in typology • EUROTYP (Bakker et al. 1993) • AUTOTYP (Bickel & Nichols 2002) • Typological Database System ontology (Dimitriadis et al. 2009) Standardization initiatives and multi- language tagsets • EAGLES (Leech & Wilson 1996) • MULTEXT/East (Erjavec 2010) • Common POS tagset for Indian languages (Baskaran et al. 2008)
BackgroundAnother problem Imagine you plan to develop a tool that makes use of a terminology repository. Which one would you choose ? Similar goals, but different definitions Integration efforts have only just began ... (RELISH*) * RELISH workshop, Aug 2010, http://www.mpi.nl/research/research-projects/language-archiving-technology/events/relish-workshop
BackgroundAnother problem Imagine you plan to develop a tool that makes use of a terminology repository. Which one would you choose ? Maybe, it‘s not even our choice ... ... our clients may have their own preferences ... and different clients may have different preferences
Ontologies of Linguistic Annotation • Background • Ontologies of Linguistic Annotation (OLiA) • Linking annotations and terminology repositories • Applications • Corpus querying • NLP
OLiAArchitecture Terminology Repositories Terminology Repositories • OLiA: Ontologies of Linguistic Annotations • conceptual integration • represent tagsets and their semantics in a formal and systematic way • „Reference Model“ • interface between annotations and (multiple) terminology repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models
OLiAResearch Background • developed at the Collaborative Research Center (SFB) 632 „Information Structure“ (Potsdam & Berlin) • 2006-2008 in cooperation with Collaborative Research Center (SFB) 441 „Linguistic Data Structures“ (Tübingen) • since 2007 within the SFB 632 project „Linguistic Database“
OLiAResearch Background • part of an infrastructure to integrate and access heterogeneous linguistic corpora • PAULA format • integrate different formats • ANNIS data base • access data created by different tools • OLiA ontologies • represent tagsets and their semantics in a formal and systematic way
OLiAOntology (Information Technology) • Ontology • Conceptualization of a knowledge domain • e.g., taxonomy of linguistic categories • hierarchical and relational structure • OWL (Web Ontology Language)* • formal description language • XML • Semantic Web • * Web Ontology Language, http://www.w3.org/2004/OWL/ (10.10.08)
OLiAOntologies of Linguistic Annotation Terminology Repositories Terminology Repositories modular OWL/DL ontologies • Annotation Models • annotation scheme • OLiA Reference Model • common terminology • External Reference Models • existing terminology repositories OLiA Reference Model • interface between annotations and (multiple) terminology repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models
OLiAReference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories • harmonization of repositories of annotation terminology • morphosyntax & morphology • 31 schemes • 51 languages* • syntax, discourse structure, anaphora, information structure Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models * including multilingual annotation schemes: Tapainen & Järvinen (1997), and Dipper et al. (2007), Erjavec (2010)
... ... ... ... ... OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories Terminology Repositories concepts Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Morphological Feature PronounOrDeterminer is-a OLiA Reference Model is-a Determiner Case is-a is-a Accusative Case Demonstrative Determiner Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models properties hasCase x y x : MorphosyntacticCategory x : Case
OLiA Annotation Models Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories • OWL/DL formalizations of annotation schemes • structure similar to the Reference Model • individuals represent annotation values • hasTag property • string value of annotation Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models
... ... OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „PDAT“ hasTag „PDS“ hasTag „...Acc...“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)
... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „PDAT“ hasTag „PDS“ hasTag „...Nom...“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)
... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „...Nom...“ hasTag „PDAT“ hasTag „PDS“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)
OLiALinking Terminology Repositories Terminology Repositories Annotation model concepts are defined as subclasses of Reference Model concepts • properties as sub-properties • individuals as instances The linking is physically separated from the models • onepossible interpretation of Annotation Model concepts in terms of the Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models
... ... ... ... OLiALinking Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model
OLiALinking (morphosyntax) • English (7 annotation models) • German (4 annotation models) • Russian (3 annotation models) • Multext-East schemes (15 languages) • Connexor (6 languages) • Tibetan (4 languages) • Old High German, Old Norse • Tagset for typological studies • more than 30 languages • many, but not exclusively African languages
OLiALinking Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model further linked to terminological repositories • if they are modelled in OWL/DL • GOLD (Chiarcos 2008) • DCR (Chiarcos 2010) • OntoTag (Buyko et al. 2008) • TDS Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models
OLiAAchievements • Annotations are mapped onto concepts in an Annotation Model • The Annotation Model is linked with the OLiA Reference Model and further terminology repositories • Annotations can be described in terms of these models independently from their original string representation • novel applications
Application • Background • Ontologies of Linguistic Annotation (OLiA) • Application • Annotation formalization & documentation • Ontology-based corpus querying • corpus browsing with ontology-based machine learning • concept-based corpus querying • NLP • interface specifications in NLP pipelines • preprocessing for Semantic Web applications • ensemble combination
Application • Background • Ontologies of Linguistic Annotation (OLiA) • Application • Annotation formalization & documentation* • Ontology-based corpus querying • corpus browsing** • concept-based corpus querying • NLP • interface specifications in NLP pipelines*** • preprocessing for Semantic Web applications**** • ontology-based ensemble combination * Ch. Chiarcos (2008) An Ontology of Linguistic Annotations. LDV Forum (GLDV-Journal for Computational Linguistics and Language Technology) 23 (2008):1-16. ** S. Hellmann et al. (accepted). The TIGER Corpus Navigator. accepted at the 9th Int. Workshop on Treebanks and Linguistic Theories (TLT9), Dec 3-4, 2010. Tartu, Estonia. *** E. Buyko, Ch. Chiarcos, and A. Pareja Lora (2008) Ontology-Based Interface Specifications for an NLP pipeline architecture. In: Proc. LREC. Marrakech, Morocco, May 2008. **** S. Hellmann (2010), The Semantic Gap of Formalized Meaning. In: The Semantic Web: Research and Applications (LNCS 6089/2010), 462-466
ApplicationCorpus querying (Chiarcos & Goetze 2007) ontological description generated corpus query ANNIS1 + OntoClient Chiarcos, Christian and Michael Götze (2007) A Linguistic Database with Ontology-sensitive Corpus Querying. GLDV-Frühjahrstagung, Tübingen, Germany.
ApplicationCorpus querying OntoClient • experimental JAVA package, based on Pellet • preprocessor for corpus queries • for every concept in the Reference Model: • retrieve associated individuals • generate a set of possible tags • set operators • intersection (&, and) • join (|, or) • intersection with complement (\, without) • generates corpus query • can be adapted to different query languages
... ... ... ... ApplicationCorpus querying original query ... pos in { Determiner \ Article } & cat = ... Reference Model Morphosyntactic Category is-a • consult the ontology • retrieve tags for every • expression that refers • to a concept in • the ontology • 2. apply operators PronounOrDeterminer is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a is-a Attributive Demonstrative Pronoun instance_of STTS Annotation Model PDAT return modified corpus query ... pos = PDAT | PWAT | ... & cat = ...
ApplicationEnsemble combination • Brill & Wu (1998) • Classifier Combination for Improved Lexical Disambiguation • errors made by three POS taggers are strongly complementary • combination => increase of accuracy 6.9% error reduction by simple voting
ApplicationEnsemble combination with ontologies • limitations • classifiers have to make use of the same annotation scheme • combining different annotation schemes may not only increase accuracy but also the level of detail • ensemble combination with ontologies • abstract from string-based annotations • operate on conceptual representations
... ... ... ... Ensemble combination with OLiAGenerating ontological descriptions OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model
tag-set independent description Morphosyntactic Category • rdf:type(olia:DemonstrativeDeterminer) • rdf:type(olia:Determiner) • rdf:type(olia:PronounOrDeterminer) is-a ... ... ... ... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem Ensemble combination with OLiA Generating ontological descriptions OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model
Comparing and combining heterogeneous linguistic analyses • challenges • determiner, not pronoun • although preceding an adverb • accusative, not nominative case • although sentence-initial and • ambigous morphology
Comparing and combining heterogeneous linguistic analyses Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem (Schmid & Laws 2008) (Tapanainen & Järvinen 1997)
Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem
Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem
Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) confidence ranking (simple voting) rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) predicted by both tools predicted by one tool
Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed confidence ranking (simple voting) rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) predicted by both tools predicted by one tool
Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed identify inconsistent descriptions rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) check consistency conditions in the ontology
concept A is consistent with concept B • if A B or B A • otherwise A and B are inconsistent Comparing and combining heterogeneous linguistic analyses olia_top:MorphosyntacticFeature is-a olia:Case is-a is-a is-a is-a olia:Nominative olia:Genitive olia:Dative olia:Accusative olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) siblings are inconsistent structure-based consistency heuristic:* * no formal consistency constraints specified in OLiA, GOLD or the DCR
Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed consistency rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) • from every equally-ranked pair of inconsistent descriptions: first come, first serve (simple voting with random tie resolution)
Comparing and combining heterogeneous linguistic analyses Diese nicht neue Erkenntnis • PronounOrDeterminer & Determiner & DemonstrativeDeterminer
Ensemble combination with ontologiesExperiments (Chiarcos 2010) • tested for POS & morphology • three German newspaper corpora • TIGER/NEGRA-style gold annotation • 7 NLP tools, 4 annotation schemes (POS) • simple voting • increase of recall • with growing number of tools • 5-6 tools outperform best-performing tool • decrease of precision • more detailed analysis than gold annotation Ch. Chiarcos (2010), Towards Robust Multi-Tool Tagging. An OWL/DL-Based Approach. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, July 2010, 659-670.
Summary • OLiA architecture • „Reference Model“ mediates between annotations and multiple terminology repositories • OWL/RDF as common representation format • applications ... • abstraction from string-based annotations • corpus querying • NLP tasks • ... not tied to the OLiA Reference Model • linking allows to operate with concepts of another terminology repository on a concept-based level
for those who may have wondered about the „mascot“: a metaphor for linguistic ontologies plant with white fruits „a tree growing out of text“* t‘ziib „script, written text“ t‘zi ba * inspired by the Madrid Codex, Yucatán, ~ 1450
OLiA-specific HTML export concepts linked via hyperlinks used for documentation purposes in SPLICR (Rehm et al. 2008) corpus metadata includes annotation model URL OLiAAnnotation documentation STTS Annotation Model concepts Reference Model concepts Comments: excerpt from the original documentation G. Rehm et al. (2008). SPLICR: A Sustainability Platform for Linguistic Corpora and Resources, In: Proceedings of the 9th Conference on Natural Language Processing (KONVENS 2008). Ergänzungsband Textressourcen und lexikalisches Wissen. Berlin, Sep 2008, 86–95.