1 / 16

Data Category Registry: Morpho -syntactic Profile

Data Category Registry: Morpho -syntactic Profile. Gil Francopoulo (TAGMATICA + CNRS-LIMSI, France), with the help of the rather active Morphosyntactic group: Nuria Bel (univ. Pompeu Fabra, Spain) Thierry Declerck (DFKI, Germany) Aida Khemakhem (Miracle/Sfax, Tunisia)

cyrus-ray
Download Presentation

Data Category Registry: Morpho -syntactic Profile

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Category Registry: Morpho-syntactic Profile Gil Francopoulo (TAGMATICA + CNRS-LIMSI, France), with the help of the rather active Morphosyntactic group: Nuria Bel (univ. Pompeu Fabra, Spain) Thierry Declerck (DFKI, Germany) Aida Khemakhem (Miracle/Sfax, Tunisia) Monte George (ANSI, USA) Sue Ellen Wright (Kent univ, USA) Chu Ren Huang (Polytechnical univ, Hong-Kong) Monica Monachini (CNR-ILC, Italy) Tokunaga Takenobu (TIT, Japan) Adam Przepiokowki (Poland) Tomaz Erjavec (JSI, Slovenia) Daniel Zeman (univ Karlova, Czech Rep) Gunta Nespore (univ of Latvia, Latvia) Karlheinz Mörth (Vienna, Austria) Karin Beck (Univ of Tübingen, Germany)

  2. Introduction • Work in progress to define an initial set of morpho-syntactic data categories dedicated to NLP applications • The aim is to improve interoperability among language resources and to optimize the process leading to their integration in applications • The main point is to be sure that when a language resource makes use of a value, the other language resources and programs have the same interpretation for this given value • These values have been collected from existing lists, discussed, extended, and then recorded within a freely accessible data base: the ISO Data Category Registry (DCR)

  3. Context • This work is done within the context of ISO-TC37 • The TC37 standards are currently elaborated as high level specifications and deal with word segmentation (ISO 24614), annotations (ISO-LAF, ISO-MAF, ISO-SynAF, i.e. 24611, 24612 and 24615), feature structures (ISO-FSD 24610), and lexicons (ISO-LMF 24613) • These standards rely on low level specifications dedicated to constants, namely data categories (revision of ISO 12620), language codes (ISO 639), scripts codes (ISO 15924), country codes (ISO 3166) and Unicode (ISO 10646)

  4. Context (cont.) • This bi-level approach will form a coherent family of standards with the following common and simple rules: 1) the low level specifications provide the constants 2) the high level specifications provide structural elements that are decorated by the constants

  5. Data model: notion of profile • The registryisdividedinto profiles • A profile is a set of data categories • Each profile isassociatedwith a team of experts (with a convenior) whocollectivelyrepresent a community of practice in the area of languageresources • There are currentlyfourteen profiles such as terminology, meta data etc. covering all activities of ISO-TC37. The currentpresentationfocuses on one profile dedicated to NLP: the morpho-syntactic profile • Note: many times, a DC belongs to only one profile, but some of thembelongs to several profiles (e.g. part of speech)

  6. Methodology: phases • We proceeded in four phases: • Phase-1: collating of candidates data categories (2006) • Phase-2: grouping, discussing, structuring, and redaction of definitions (2007-2008) • Phase-3: global revision (2009) • Phase-4: welcome a group of new comers for another revision (2010)

  7. Methodology: sources • For the morpho-syntactic profile, a long list has been collected from: • ISO-12620:1999 • Eagles and Multext-East • Some values for Semitic languages coming from Sfax Univ. • Some values needed for ISO-TC37 standards (MAF, synAF, LMF) were also added • Some isolated values were also coming from various remarks in 2010 • These values have been collected in close coordination with the syntactic profile in order to distinguish the morphosyntactic and the syntactic values. For the syntactic values, an initial list was collected, based on: • Eagles • Tiger (German project) • Technolangue/Easy (French project)

  8. Methodology: detail of recording • Each DC has an identifier that is English based: use of camel case style (e.g. commonNoun), as specified in the revision of ISO-12620 • Each DC has a definition in English and French. The text respects the ISO rules for definitions. A definition may be complemented by a note. • A DC may be linked through a broader link to another DC. A DC may have a value domain. • Each DC has at least, a name in English and one in French, which may be used directly for display without any transformation (e.g. « common noun »)

  9. Current registry • The 12620 revision work started in 2003 and a lot of energy has been spent in order to find an operational consensus • The model is implemented in a system called « isocat » which is currently running and located at: « http://www.isocat.org » • A dozen of people have entered values, mainly in the domain of metadata, terminology, morpho-syntax, and syntax. The other profiles are almost empty. • The number of values is rather huge (468), so in order to facilitate management, a series of sub-profiles were created

  10. Practical organization of data Morpho-syntactic profile: Basics 61 These aregeneral purpose linguistic constants, like: comment, derivation, elision, foreignText, and label. Cases 33 Examples of values: ablativeCaseor dativeCase. FormRelated 36 These are constants for the specifications of forms like: spokenForm, writtenForm, abbreviation, expansionVariation, transliteration, romanization, transcription, script. Morphological Features excluding cases 82 Attributes includefor instance grammaticalGender, moodand tense. Values include,for instance,feminine,indicative, present. Operations 29 Constants includefor instance,addAffix, addLemma. Part of speech 120 Part of speech values arestructured with a top level set composed of 10 values like nounor verb. A very precise ontology is specified forgrammatical words. Most of parts of speech are common to lexicons and annotations but two set of values (i.e. punctuationand residual) are specific to annotation and are not usually used in lexical descriptions. Register, dating and frequency 19 Constants include,for instance,slangRegisterorrarelyUsed. Total 380

  11. Extract: genitiveCase illativeCase inessiveCase instrumentalCase lativeCase locativeCase nominativeCase obliqueCase partitiveCase prolativeCase sociativeCase sublativeCase superessiveCase terminativeCase translativeCase vocativeCase Cases: abessiveCase ablativeCase absolutiveCase accusativeCase adessiveCase aditiveCase allativeCase benefactiveCase causativeCase comitativeCase dativeCase delativeCase elativeCase equativeCase ergativeCase essiveCase

  12. Extract: native orthographyName patternType phoneticForm phoneticSeparator pinyin nonSpacedPinyin spacedPinyinAndTonereduplication root script stem stemRank symbol token writtenForm Form related values: affix infix prefix suffix affixRank allomorph apocope componentRank conjugated contextualVariation expansionVariation geographicalVariant graphicalSeparator homograph homonym homophone lemma lexicalType morpheme etymologicalRoot

  13. Problems encountered • As said earlier, we started from existing lists that are rather stable like those for Eagles or Multext-East • The problems that we encountered were that we had to write definitions. We searched in various sources and found some definitions that apparently looked fine in isolation but they did not collectively constitute a coherent set of definitions • Linguistics is not a field with a common agreement on basic terms. Ex: paradigm, collocation, morpheme, ergative • As a matter of example, look at the entry « morphology » in Wikipedia • Another problem we faced was that we had to write definitions that are valid for both lexicon and annotation activities. Ex « word » • To deal with this problem, we carefully avoided some dangerous terms

  14. Forthcoming data • The current database records values for West/East European languages and, to a certain extend, for Semitic languages • We know that it is clearly not enough • Two parallel tasks are currently being conducted • One task deals with Asian values within the NEDO project. A small set of DC has been entered in the database • The other task deals with the DCs specifically needed for African languages: a study is being conducted by the ISO South African delegation, but the values are not entered yet in the database

  15. Conclusion • The registry is far from being complete but it begins to be used within different applications in order to be tested. • The idea is to progressively increase the number and coverage of these data categories • The ambition is that the registry will become the reference point when using linguistic terms and data elements in lexicons and annotations within NLP context • Thank you for your attention

More Related