1 / 36

An Ontology for Linguistic Representation

Explore linguistic ontology and endangered language data sharing over the Semantic Web. Challenges in data encoding and concept modeling. Strategies for broad access.

pgardner
Download Presentation

An Ontology for Linguistic Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Ontology for Linguistic Representation Scott Farrar, Terry Langendoen, William Lewis University of Arizona

  2. Overview • Discuss a proposal of how linguistic data can be shared over the Semantic Web. • Endangered language data (EMELD) • Special focus on a linguistic ontology (conceptual modeling in linguistics)

  3. Ontology and Linguistics • How is the lexicon reflected in top-level distinctions? (Gangemi, Guarino, Masolo, and Oltramari 2001) • How are functional/grammatical categories reflected in the ontology?

  4. EMELD • EMELD (Electronic Metastructure for Endangered Languages Data) • As many as half of the world’s languages (3000 out of 6000) are in danger of disappearing LaPolla (1998) • Purpose is to preserve endangered language data by creating a community of practice via the Semantic Web.

  5. Linguistic Data • Field linguists collect data. • Hopi: sivu-’ikwiw-ta-qa [vessel-carry: on: back-DUR-REL] ‘kachina’ • Linguistic data includes: grammars, dictionaries, text, sound and video recordings, glossed corpora subject language analysis (markup) gloss (markup)

  6. Grammatical Descriptions • English has noun-verb agreement in number and person. • Warumungu is ergative-absolutive. • Archi has extensive spatial cases. • Spanish is SVO.

  7. Facts about Language Nouns represent objects. Tense relates an event with a point in time. Case is a relation between a predicate and its argument, e.g., He knows him. SOV, SVO, OVS are possible natural language word orders. general concepts linguistic concepts

  8. Challenges to Creating a Community of Practice • Language data should be searchable and comparable—broad access. • Few standardized methods for encoding of language data (cf. EAGLES and TEI). • Authors or communities want control over their data. • Local control should be balanced with data interoperability Semantic Web

  9. Example of the Problem google query: [“past tense” Australian languages] Web Umpila [Prehodiernal Tense] Warumungu [PAST] Dyirbal [PST]

  10. Other Examples of the Problem • homonymous terminology: A search for PA intended to mean ‘Partitive’ might return PA meaning Past or PerfectiveAspect. • “covert” markup: Hopi CAUS really means Causative combined with PerfectiveAspect

  11. Some Further Complications • A search for present tense forms in English returns future tense, e.g., “Tomorrow, John goes to Holland.” • Some language has past, present, habitual analyzed as grammatical tenses (Hopi). • Searching for habitual aspect in Hopi does not guarantee anything to do with aspect.

  12. Solution Strategy for Providing Broad Access to Language Data • Integrate the data using metastructure—a linguistic ontology. • Make the data available in standard format without imposing a standard for markup. • Build tools to access and process the data—query engines, expert systems…

  13. Challenges to Conceptual Modeling of Linguistic Domain • domain is large • most concepts are abstract • linguistic objects are hierarchical • language is symbolic • field is fragmented—few standards (cf. chemistry, computer science)

  14. Linguistic Ontology • Our starting point is morpho-syntax (word parts, tense, case, aspect, inflection) • linguistic segments, grammatical concepts, data structures • Built on top of the Standard Upper Merged Ontology (SUMO)

  15. Standard Upper Merged Ontology(SUMO) • extensible resource • already includes a number of concepts related to semiotics and linguistics • connection with the NLP community (WordNet; B. Levin’s verb classes; Allen’s tense logic) • developed by an IEEE working group and is freely available (http://suo.ieee.org) (Niles and Pease 2001)

  16. Physical Entities What entities are physical, i.e., exist in space-time? • The word and its parts (written or spoken): Stems, Affixes, Roots, Phonetic segments • Larger constituents: Phrases, Clauses, Texts

  17. Taxonomy for LinguisticExpression Entity Physical Object SelfConnectedObject ContentBearingObject Icon SymbolicString LinguisticExpression WrittenLinguisticExpression SpokenLinguisticExpression

  18. Taxonomy for LinguisticExpression WrittenLinguisticExpression WordPart SimpleWordPart Root Affix Prefix Infix Suffix Clitic Stem Word SimpleWord ComplexWord Compound Phrase instance-of ‘un-’ instance-of ‘dog’

  19. Other Possibilities recall: Word SimpleWord ComplexWord Compound Word Noun SimpleNoun ComplexNoun Verb SimpleVerb ComplexVerb Adjective … instance-of ‘dog’ instance-of ‘jump’ instance-of ‘dog’ instance-of ‘jump’

  20. Part of Speech as Property • Language exhibits categorical ambiguity (e.g., “fish”). • No noun/verb distinction in some languages—language specific (e.g., Lummi). • Related to the notions of “rigid” and “anti-rigid” w.r.t. properties in Ontoclean (Guarino and Welty 2002)

  21. Mereological Relations for LinguisticExpression A Stem has-part Root. (=> (instance ?STEM Stem) (exists (?PART) (and (part ?PART ?STEM) (instance ?PART Root))))

  22. Mereological Relations for LinguisticExpression A Word has-part Stem. (=> (instance ?WORD Word) (exists (?PART) (and (part ?PART ?WORD) (instance ?PART WordPart))))

  23. Other Axioms for LinguisticExpression • None for WrittenLinguisticExpression, without referring to abstract section of ontology. • Phonetic overlay (intonation contour)

  24. Abstract Entities • What entities are abstract, i.e., qualities or attributes? Grammatical attributes: Tense, Aspect, Case Linguistic data structures: Paradigms, Feature Structures, PhonemeTables, Derivations (as in “Minimalism”) Mental entities: Morpheme, Phoneme, Lexeme

  25. Taxonomy of GrammaticalProperty Abstract Relation Proposition Attribute InternalAttribute RelationalAttribute ? ( ) GrammaticalAttribute PartOfSpeech Tense Aspect Case …

  26. Relational SyntacticRole subject object CaseRole agent patient These are relational much the same as PositionalAttribute, e.g., above.

  27. Internal PartOfSpeech noun verb determiner Within a grammar, these do not appear to be relational, cf. ShapeAttribute.

  28. Problematic Cases Tense Aspect Case • As grammatical attributes, these do not appear to be relational. • But what about meaning? • After all, that’s the goal of the Semantic Web.

  29. Grammar-Meaning Distinction • GrammaticalAttribute should be tied to a language (e.g., to construct paradigms). • Semantic notions do not have to be tied to a specific language. • Tendency in language is to match these up, e.g., verbs pick out predicates, nouns pick out terms and functions. • Salishan languages—no verb-noun distinction.

  30. Tense AbsolutePastTense: (=> (AbsolutePastTense ?SENTENCE) (and (exists ?INTERVAL TimeInterval) (during ?INTERVAL (WhenFn (ProcessFn ?SENTENCE))) (before ?INTERVAL (WhenFn ?SENTENCE)))) HodiernalPastTense—?INTERVAL has to be during ‘yesterday’ RemotePastTense--?INTERVAL has to be before a certain time point (language-specific)

  31. Tense RelativePastTense: (=> (RelativePastTense ?SENTENCE) (and (exists ?INTERVAL TimeInterval) (exists ?POINT TimePoint) (during ?INTERVAL (WhenFn (ProcessFn ?SENTENCE))) (before ?INTERVAL ?POINT)))

  32. Taxonomy of Case Case EventiveCase AffectorCase AffecteeCase NoneventiveCase xxxxCase SpatialCase DirectionalCase PositionalCase instance-of ErgativeCase instance-of DativeCase instance-of ExistentialCase instance-of IllativeCase

  33. Predicates in the SUMO • Linguistic/Semiotic Predicates refers represents containsInformation realization representsInLanguage (containsInformation ?SENT ?PROP) (representsInLanguage ?THING ?ENTITY ?LANGUAGE)

  34. Additional Predicates Need a more linguistically-centered predicate that acts deictically and ‘picks out’ or points to a particular instance. (designates ?LinguisticExpression ?Entity)

  35. Future directions • Extend ontology into the domains of phonology and syntax. • Recommend markup approach (XML) • Explore applications of the ontology beyond the immediate EMELD project

  36. Contact Info • Terry Langendoen • Scott Farrar langendt@u.arizona.edu farrar@u.arizona.edu • See our website: http://emeld.douglass.arizona.edu:8080

More Related