220 likes | 307 Views
The GOLD Effort So Far. Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics University of Arizona. Acknowledgments. Everyone else who’s worked on E-MELD at U Arizona 2001-05, especially: Graduate students : Scott Farrar, Will Lewis, Peter Norquest, Ruby Basham
E N D
The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics University of Arizona E-MELD 2005 Ontologies in Linguistic Annotation
Acknowledgments • Everyone else who’s worked on E-MELD at U Arizona 2001-05, especially: • Graduate students: Scott Farrar, Will Lewis, Peter Norquest, Ruby Basham • Undergraduate students: Jesse Kirchner, Shauna Eggers, Alexis Lanham, Sandy Chow • Everyone who’s worked on E-MELD elsewhere, especially: • Gary, Helen, Anthony, Laura, Zhenwei, Baden, Doug E-MELD 2005 Ontologies in Linguistic Annotation
Whalen’s problem • “We want to be able to describe the data in just the way we want, but we don’t want to program it.” • Doug Whalen, at 2001 E-MELD Workshop E-MELD 2005 Ontologies in Linguistic Annotation
Our problem • We want to be able to describe the data in just the way we want, and we want to be able to use everybody else’s data described in just the way they want, and we want to be able to process it in all kinds of ways that make sense to us as scientists and teachers. • Call this the interoperability problem. E-MELD 2005 Ontologies in Linguistic Annotation
TEI’s data interchange solution • Create a “data interchange” format such as the Text Encoding Initiative’s P3. • Require projects that wish to share data to define mappings to and from the interchange format. φψˉ¹ X ——————-> P3 ——————> Y ψφˉ¹ Y ——————-> P3 ——————> X E-MELD 2005 Ontologies in Linguistic Annotation
Two lessons from the TEI • Use a standard markup language. • Our choice (like theirs): XML. • Individual projects don’t have to use XML, but their software should export to XML. E-MELD 2005 Ontologies in Linguistic Annotation
XML markup is syntax • In TEI, the tags <s>, <w> and <m> were designed to delimit sentences, words and morphemes respectively. • But they can be used to describe any three-level hierarchy over character strings, such as: • <s> = sentence, <w> = word, <m> = morpheme • <s> = paragraph, <w> = sentence, <m> = word • <s> = chapter, <w> = paragraph, <m> = morpheme • <s> = big chunk, <w> = middle-size chunk, <m> = small chunk E-MELD 2005 Ontologies in Linguistic Annotation
Two avenues to markup semantics • The syntax is the semantics (SIS) • This is essentially the TEI solution. • Leave the semantics to us (LSU) • Essentially the “Semantic Web” idea E-MELD 2005 Ontologies in Linguistic Annotation
Problems with SIS • Hard sell. Based on the TEI experience, it’ll be hard to convince linguists to use it. • Expensive. It will be costly to retrofit existing resources to conform to it. • Fragile. Future changes will be likely to break existing applications. E-MELD 2005 Ontologies in Linguistic Annotation
Advantages of LSU • Easier sell. Can have lots of special purpose markup schemas for different purposes, which will be easier to use. • Cheap. Migration to best practice much less costly. • Robust. Changes are less likely to break existing applications. E-MELD 2005 Ontologies in Linguistic Annotation
Place of a linguistic ontology as part of LSU • The central component of LSU is a linguistic ontology that: • defines the common concepts used in linguistic analysis and description, • expresses the relations that hold among those concepts, • relates those concepts to concepts of common-sense understanding (“upper” ontology) and concepts in other disciplines. E-MELD 2005 Ontologies in Linguistic Annotation
Proof of concept that it works • Last year, the Arizona team, together with Gary, Scott, and Will’s team at CSU Fresno, showed that GOLD could be used for smart searching across massive cross-linguistic databases created from XML documents of different types. • Interlinear glossed texts • Lexicons E-MELD 2005 Ontologies in Linguistic Annotation
The GOLD Summit • Last November, Will hosted a summit meeting of researchers most involved with GOLD to plan for its further development and maintenance after Arizona’s E-MELD funding ran out yesterday. It recommended: • Creating a GOLD website. • Forming a GOLD Council with oversight responsibility, and putting procedures in place using the OLAC model to foster and evaluate development and maintenance. • Focusing the E-MELD 2005 workshop on GOLD. E-MELD 2005 Ontologies in Linguistic Annotation
Current state of play • We’re proposing to move GOLD “out of the lab” effective with this meeting despite the fact that: • GOLD version 0.2 has very small coverage, even within morphosyntax, and many areas of the field are not covered at all. • Several important design issues have not been settled. • What upper ontology should we use? (Currently SUMO) • Some “core GOLD” concepts are in flux. • We broke last year’s applications with our redesign of the treatment of grammatical features. E-MELD 2005 Ontologies in Linguistic Annotation
Classes and instances in GOLD 0.1 (“Old GOLD”) • Reasoning with classes and instances • If i is of type A and A is a subclass of B, then i is of type B. • For example, a search for instances of Verb will find all instances of both TransitiveVerb and IntransitiveVerb. E-MELD 2005 Ontologies in Linguistic Annotation
A problem with saying what we want about language X • In language X, verbs are inflected only for tense. • Verb inflectedFor Tense? • This won’t do if both subject and object of the relation are classes. • Fails to represent the claim that tense is the only feature that verbs are inflected for in X. • XVerb inflectedFor XTense? • OK, since XVerb and XTense are both instances (of the GOLD classes Verb and Tense respectively) • Lack of other inflectional features will show up in response to query. E-MELD 2005 Ontologies in Linguistic Annotation
A problem with saying what we want in GOLD • XTense hasValue XFutureTense • OK since hasValue relates instances. • Tense hasValue FutureTense • Not OK since hasValue relates classes. E-MELD 2005 Ontologies in Linguistic Annotation
Parallel structures for GOLD and language-specific concepts • Allow certain GOLD concepts to be instances of other GOLD classes. In particular, define atomic feature values as instances of particular feature classes. • Allow certain language-specific concepts to be classes that are instantiated by other language-specific concepts. In particular, define language-specific features as classes instantiated by their language-specific values. E-MELD 2005 Ontologies in Linguistic Annotation
Feature systems as substructures Any /|\ / | \ / | \ / | \ / | \ / | \ NonP HodP PreHodP TenseSystem-x as a substructure of TenseFeature E-MELD 2005 Ontologies in Linguistic Annotation
Mapping from a language class to a GOLD class +------------+ +------------+ | Any <------+----+-- XAny | | | | | | NonP <-----+----+-- XPres | | | | | | HodP <-----+----+-- XRecP | | | | | | PreHodP <--+----+-- XRemP | +------------+ +------------+ Mapping to GOLD TenseSystem-x from XTense E-MELD 2005 Ontologies in Linguistic Annotation
Isomorphism between a language system and a GOLD system XAny /|\ / | \ / | \ / | \ / | \ / | \ XPres XRecP XRemP XTense system isomorphic to TenseSystem-x E-MELD 2005 Ontologies in Linguistic Annotation
Future of GOLD ? E-MELD 2005 Ontologies in Linguistic Annotation