290 likes | 302 Views
A Common Ontology for Linguistic Concepts. Scott Farrar University of Arizona. Overview. Discuss a project for the preservation of endangered language data. Endangered language data Overview of the EMELD system Special focus on the linguistic ontology motivation detailed examples
E N D
A Common Ontology for Linguistic Concepts Scott Farrar University of Arizona
Overview • Discuss a project for the preservation of endangered language data. • Endangered language data • Overview of the EMELD system • Special focus on the linguistic ontology motivation detailed examples special problems
Endangered Languages • As many as half of the world’s languages (3000 out of 6000) are in danger of disappearing LaPolla (1998) • In particular, languages in Africa (Kiong), the Americas (Hopi), Australia (Mara), and Southeast Asia (Biao Min). • Many are not written languages. • Language data is of interest because of cultural and scientific significance.
EMELD • EMELD (Electronic Metastructure for Endangered Languages Data) • An NSF funded effort • Purpose is to preserve endangered language data.
Linguistic Data • Field linguists collect data. • Hopi: sivu-’ikwiw-ta-qa [vessel-carry: on: back-DUR-REL] ‘kachina’ • Linguistic data includes: grammars, dictionaries, text, sound recordings, glossed corpora object language markup markup
Linguistic Markup Purpose: classifying languages, discovering universals, preserving knowledge. Form: XML, HTML, etc. (stand-off vs. in-line) Content: grammatical categories and relations, speech acts, cultural content
Challenges to Preserving Linguistic Data • Authors or communities want exclusive rights. • No standardized method for linguistic encoding. • Language data should be searchable and comparable—broad access. • Local control should be balanced with data interoperability.
Attempts at Standardization • Text Encoding Initiative (TEI) (Sperberg-McQueen and Burnard 1994) • Corpus Encoding Standard (CES) (Ide and Romary 2000)
Solution Strategy for Preserving Language Data • Integrate the data using metadata—a linguistic ontology. • Make the data available on the Semantic Web. • Build tools to access and process the data—query engines, expert systems.
UI • Query: • Australian • SOV • Text EMELD Query Engine Linguistic Ontology Semantic Web Hopi Mocovi Biao Min EMELD Architecture
Markup is Mapped to the Linguistic Ontology Linguistic Ontology Hopi <DUR> <REL> <ANMT> Biao Min <ILL> <NOUN> <CAUS> Mocovi <SENT> <REL> <MOOD>
Interoperability ontology markup markup markup markup Data Data Data Data Data
Ontology Formalizes Metadata Nouns represent objects. Tense relates an event with a point in time. Case is a relation between a predicate and its argument, e.g., He knows him. SOV, SVO, OVS are possible natural language word orders. general concepts linguistic concepts
Linguistic Ontology • Conceptual model for the linguistics domain • Starting point is morpho-syntax (tense, case, aspect, causality, inflection) • Built on top of the Standard Upper Merged Ontology (SUMO) (Niles and Pease 2001)
Standard Upper Merged Ontology(SUMO) • extensible resource • incorporates concepts from a number of top-level ontologies • already includes a number of concepts related to semiotics and linguistics • peer-reviewed (IEEE working group) and freely available (http://suo.ieee.org)
The Upper Ontology contains nearly 700 terms and 2500 assertions Abstract Class Proposition Quantity PhysicalQuantity Number … Attribute BiologicalProperty … Entity Physical Object Organic Inorganic … Collection Process NaturalProcess Motion …
Predicates in the SUMO • Structural Predicates instance domain domainSubclass subclass subrelation valence range documentation disjoint (subclass Virus Microorganism)
Predicates in the SUMO • Linguistic/Semiotic Predicates refers represents containsInformation realization representsForAgent representsInLanguage (containsInformation ?SENT ?PROP)
Upper Linguistic Ontology Entity Physical Object ContentBearingObject Icon SymbolicString LinguisticExpression WrittenLinguisticExpression Text Sentence Phrase Word Morpheme SpokenLinguisticExpression Dialogue Sentence Phrase Word Morpheme SUMO Concepts
Upper Linguistic Ontology (cont.) Abstract Class Relation Predicate GrammaticalRelation Aspect Tense Case Agreement Attribute GrammaticalAttribute Gender Person Number Extended SUMO contains 300 concepts
Simple Query Example (same-concept ?Morpheme PT) EMELD markup for ‘past tense’ Examples containing synonymous tags in other datasets: PAST, PST, … ,TENSE
Simple Query Example ContentBearingObject SymbolicString Tag KiongTag ‘PT’ MaraTag ‘PAST’ Class Relation Predicate GrammaticalRelation Tense PastTense represents represents
Query Example subdomain—Language and Space Expressed by case, prepositions and verbs
Morphosyntactic Case Case InherentCase Spatio-KineticCases PositionalCase InessiveCase DirectionalCase IllativeCase ExistentialCase AbessiveCase PartitiveCase InstrumentalCase StructuralCase GenitiveCase AccusativeCase NominativeCase instances of Case express propositions e.g., spatio-kinetic relations or predicate-argument “who did what to whom”
Query Example (cont.) (same-concept ?Morpheme ‘into’) EMELD Glossed text from Hopi Word list from Biao Min
Query Example (cont.) LinguisticExpression Word Preposition ‘into’ Case InherentCase SpatioKineticCase … IllativeCase (=> (exists ?OBJ1 ?OBJ2 ?EVENT) (instance ?OBJ1 SelfConnectedObject) (instance ?OBJ2 SelfConnectedObject) (penetrable ?OBJ2) (instance ?EVENT Motion) (outside ?OBJ1 ?OBJ2) (inside ?OBJ1 ?OBJ2))
Challenges to Conceptual Modeling • Ontological Commitment: text and speech can express the same proposition. But are they the same? • Linguistic theory: what is a Sentence anyway? • Linguistics touches on many domains • causality • time and space • action
Future directions • Extend ontology into the domains of phonology and discourse analysis. • Recommend markup approach (XML) • Explore applications of the ontology beyond the immediate EMELD project: • Topic Maps: Consider the linguistic ontology as part of a Published Subject for linguistics. • Machine Translation: Explore the need for such an ontology in MT.
Contact Info • Scott Farrar farrar@u.arizona.edu • See our website: http://emeld.douglass.arizona.edu