370 likes | 378 Views
Explore linguistic ontology and endangered language data sharing over the Semantic Web. Challenges in data encoding and concept modeling. Strategies for broad access.
E N D
An Ontology for Linguistic Representation Scott Farrar, Terry Langendoen, William Lewis University of Arizona
Overview • Discuss a proposal of how linguistic data can be shared over the Semantic Web. • Endangered language data (EMELD) • Special focus on a linguistic ontology (conceptual modeling in linguistics)
Ontology and Linguistics • How is the lexicon reflected in top-level distinctions? (Gangemi, Guarino, Masolo, and Oltramari 2001) • How are functional/grammatical categories reflected in the ontology?
EMELD • EMELD (Electronic Metastructure for Endangered Languages Data) • As many as half of the world’s languages (3000 out of 6000) are in danger of disappearing LaPolla (1998) • Purpose is to preserve endangered language data by creating a community of practice via the Semantic Web.
Linguistic Data • Field linguists collect data. • Hopi: sivu-’ikwiw-ta-qa [vessel-carry: on: back-DUR-REL] ‘kachina’ • Linguistic data includes: grammars, dictionaries, text, sound and video recordings, glossed corpora subject language analysis (markup) gloss (markup)
Grammatical Descriptions • English has noun-verb agreement in number and person. • Warumungu is ergative-absolutive. • Archi has extensive spatial cases. • Spanish is SVO.
Facts about Language Nouns represent objects. Tense relates an event with a point in time. Case is a relation between a predicate and its argument, e.g., He knows him. SOV, SVO, OVS are possible natural language word orders. general concepts linguistic concepts
Challenges to Creating a Community of Practice • Language data should be searchable and comparable—broad access. • Few standardized methods for encoding of language data (cf. EAGLES and TEI). • Authors or communities want control over their data. • Local control should be balanced with data interoperability Semantic Web
Example of the Problem google query: [“past tense” Australian languages] Web Umpila [Prehodiernal Tense] Warumungu [PAST] Dyirbal [PST]
Other Examples of the Problem • homonymous terminology: A search for PA intended to mean ‘Partitive’ might return PA meaning Past or PerfectiveAspect. • “covert” markup: Hopi CAUS really means Causative combined with PerfectiveAspect
Some Further Complications • A search for present tense forms in English returns future tense, e.g., “Tomorrow, John goes to Holland.” • Some language has past, present, habitual analyzed as grammatical tenses (Hopi). • Searching for habitual aspect in Hopi does not guarantee anything to do with aspect.
Solution Strategy for Providing Broad Access to Language Data • Integrate the data using metastructure—a linguistic ontology. • Make the data available in standard format without imposing a standard for markup. • Build tools to access and process the data—query engines, expert systems…
Challenges to Conceptual Modeling of Linguistic Domain • domain is large • most concepts are abstract • linguistic objects are hierarchical • language is symbolic • field is fragmented—few standards (cf. chemistry, computer science)
Linguistic Ontology • Our starting point is morpho-syntax (word parts, tense, case, aspect, inflection) • linguistic segments, grammatical concepts, data structures • Built on top of the Standard Upper Merged Ontology (SUMO)
Standard Upper Merged Ontology(SUMO) • extensible resource • already includes a number of concepts related to semiotics and linguistics • connection with the NLP community (WordNet; B. Levin’s verb classes; Allen’s tense logic) • developed by an IEEE working group and is freely available (http://suo.ieee.org) (Niles and Pease 2001)
Physical Entities What entities are physical, i.e., exist in space-time? • The word and its parts (written or spoken): Stems, Affixes, Roots, Phonetic segments • Larger constituents: Phrases, Clauses, Texts
Taxonomy for LinguisticExpression Entity Physical Object SelfConnectedObject ContentBearingObject Icon SymbolicString LinguisticExpression WrittenLinguisticExpression SpokenLinguisticExpression
Taxonomy for LinguisticExpression WrittenLinguisticExpression WordPart SimpleWordPart Root Affix Prefix Infix Suffix Clitic Stem Word SimpleWord ComplexWord Compound Phrase instance-of ‘un-’ instance-of ‘dog’
Other Possibilities recall: Word SimpleWord ComplexWord Compound Word Noun SimpleNoun ComplexNoun Verb SimpleVerb ComplexVerb Adjective … instance-of ‘dog’ instance-of ‘jump’ instance-of ‘dog’ instance-of ‘jump’
Part of Speech as Property • Language exhibits categorical ambiguity (e.g., “fish”). • No noun/verb distinction in some languages—language specific (e.g., Lummi). • Related to the notions of “rigid” and “anti-rigid” w.r.t. properties in Ontoclean (Guarino and Welty 2002)
Mereological Relations for LinguisticExpression A Stem has-part Root. (=> (instance ?STEM Stem) (exists (?PART) (and (part ?PART ?STEM) (instance ?PART Root))))
Mereological Relations for LinguisticExpression A Word has-part Stem. (=> (instance ?WORD Word) (exists (?PART) (and (part ?PART ?WORD) (instance ?PART WordPart))))
Other Axioms for LinguisticExpression • None for WrittenLinguisticExpression, without referring to abstract section of ontology. • Phonetic overlay (intonation contour)
Abstract Entities • What entities are abstract, i.e., qualities or attributes? Grammatical attributes: Tense, Aspect, Case Linguistic data structures: Paradigms, Feature Structures, PhonemeTables, Derivations (as in “Minimalism”) Mental entities: Morpheme, Phoneme, Lexeme
Taxonomy of GrammaticalProperty Abstract Relation Proposition Attribute InternalAttribute RelationalAttribute ? ( ) GrammaticalAttribute PartOfSpeech Tense Aspect Case …
Relational SyntacticRole subject object CaseRole agent patient These are relational much the same as PositionalAttribute, e.g., above.
Internal PartOfSpeech noun verb determiner Within a grammar, these do not appear to be relational, cf. ShapeAttribute.
Problematic Cases Tense Aspect Case • As grammatical attributes, these do not appear to be relational. • But what about meaning? • After all, that’s the goal of the Semantic Web.
Grammar-Meaning Distinction • GrammaticalAttribute should be tied to a language (e.g., to construct paradigms). • Semantic notions do not have to be tied to a specific language. • Tendency in language is to match these up, e.g., verbs pick out predicates, nouns pick out terms and functions. • Salishan languages—no verb-noun distinction.
Tense AbsolutePastTense: (=> (AbsolutePastTense ?SENTENCE) (and (exists ?INTERVAL TimeInterval) (during ?INTERVAL (WhenFn (ProcessFn ?SENTENCE))) (before ?INTERVAL (WhenFn ?SENTENCE)))) HodiernalPastTense—?INTERVAL has to be during ‘yesterday’ RemotePastTense--?INTERVAL has to be before a certain time point (language-specific)
Tense RelativePastTense: (=> (RelativePastTense ?SENTENCE) (and (exists ?INTERVAL TimeInterval) (exists ?POINT TimePoint) (during ?INTERVAL (WhenFn (ProcessFn ?SENTENCE))) (before ?INTERVAL ?POINT)))
Taxonomy of Case Case EventiveCase AffectorCase AffecteeCase NoneventiveCase xxxxCase SpatialCase DirectionalCase PositionalCase instance-of ErgativeCase instance-of DativeCase instance-of ExistentialCase instance-of IllativeCase
Predicates in the SUMO • Linguistic/Semiotic Predicates refers represents containsInformation realization representsInLanguage (containsInformation ?SENT ?PROP) (representsInLanguage ?THING ?ENTITY ?LANGUAGE)
Additional Predicates Need a more linguistically-centered predicate that acts deictically and ‘picks out’ or points to a particular instance. (designates ?LinguisticExpression ?Entity)
Future directions • Extend ontology into the domains of phonology and syntax. • Recommend markup approach (XML) • Explore applications of the ontology beyond the immediate EMELD project
Contact Info • Terry Langendoen • Scott Farrar langendt@u.arizona.edu farrar@u.arizona.edu • See our website: http://emeld.douglass.arizona.edu:8080