1 / 26

Standards for language resources the ISO/TC 37(/SC 4) perspective

Standards for language resources the ISO/TC 37(/SC 4) perspective. Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair. Context. ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif

roza
Download Presentation

Standards for language resources the ISO/TC 37(/SC 4) perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standards for language resources the ISO/TC 37(/SC 4) perspective Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair

  2. Context • ISO TC37 - Terminology and other language resources • SC3 - Computer applications in terminology • ISO 12200 - Martif • ISO 12620 - Data categories (under revision) • ISO 16642 - TMF (Terminological Markup Framework) • SC4 - Language Resource Management www.tc37sc4.org

  3. An example scenario: information extraction Semantic content Content analysis Syntactic structures Chunk parsing Part-of-speech tagging POS tagging Primary Data

  4. XML RDF Horizontal view(W3C perspective) Semantic content OWL Content analysis Syntactic structures Chunk parsing Part-of-speech tagging SOAP POS tagging Primary Data

  5. Vertical view(ISO/TC 37/SC 4 perspective) Semantic content Content analysis Evaluation Linguistic models and descriptors (Data Categories) Syntactic structures Chunk parsing Lexica Part-of-speech tagging POS tagging Primary Data

  6. Linguistic information sources …and initiatives Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX, XLIFF, XHTML, etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]

  7. SC4 Approach • Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources • In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches • No provision of new formats • Situate development squarely in the framework of XML and related standards • Ensure compatibility with established and widely accepted web-based technologies • Ensure feasibility of transduction from legacy formats into newly defined formats

  8. SC4 and other standardizing bodies Contributing organizations ----- ----- ----- • TEI • text representation • Reference for primary sources • e.g.: text archives ----- Oscar Text • W3C • basic protocols and formats • XML (Schemas) • XPath • XPointer • + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech

  9. Data categories ISO/TC 37/SC 4 structure WG4 Lexical databases WG2 Representation schemes WG3 Multilingual text representation WG5 Workflow of language Resource Management WG1 Basic descriptors and mechanisms for language resources

  10. On-going activities • Feature structure representation (in collaboration with the TEI - Text Encoding Initiative) • ISO DIS 24610 • Morpho-syntactic annotation • ISO NP 24611 • Lexical markup framework • ISO NP 24612 (+ ISO NP 12620-3) • Task force on Meta-data for language resources (OLAC+IMDI) • ACL/Sigsem working group on multimodal content representation • Data category registry for ISO/TC 37 • ISO CD 12620-1 on ballot (deadline Jan. 2004)

  11. Modeling linguistic annotation structures

  12. General framework - 1 • Model for linguistic annotation that can • be instantiated in a standard representational format • GMT: Generic Mapping Tool • serve as a pivot format into and out of which proprietary formats may be transduced to enable • Comparison, merging, manipulation via common tools • Reference: ISO 16642 - Terminological Markup Framework

  13. General framework - 2 • A meta-model • A general, underlying model that informs current practice • A set of data-categories • Provides to precise semantics of the format • Obtained: • By sub-setting a Data Category Registry • By providing application specific categories

  14. ISO 16642: A family of formats TMF … TML1 TML2 TML3 TMLi (Geneter) (TBX) GMT

  15. Meta-model Terminological Data Collection (TDC) Global Information (GI) Complementary Information (CI) * Terminological Entry (TE) * Language Section (LS) * Term Section (TS) * Term Component Section (TCS)

  16. TMF: example id=‘ID67’ subjectField=‘ manufacturing ’ definition=‘A value…’ TE LS LS lang=‘ hu ’ lang=‘ en ’ term=‘alpha smoothing factor’ termType=‘fullForm’ TS term=‘…’ TS

  17. Implementation in TBX(cf. www.lisa.org) <termEntry id='ID67'> <descrip type='subjectField‘>manufacturing</descrip> <descrip type='definition'>A value between 0 and 1 used in ...</descrip> <langSet lang='en'> <tig> <term>alpha smoothing factor</term> <termNote type='termType'>fullForm</termNote> </tig> </langSet> <langSet lang='hu'> <tig> <term>Alfa ...</term> </tig> </langSet> </termEntry>

  18. Implementing a Data Category Registry for ISO TC37

  19. Data Category • Definition: • Elementary descriptor used in a linguistic description or annotation scheme • Example: • /Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/ • Background: • Experience gained from ISO 16642 in linguistic format specification • Wider notion of data-categories as meta-data for tagged language resources

  20. Multiple uses of data categories Documentation Meta-data XML schemas Data category selection Meta model XSL filters

  21. Application domains • Terminological data collection (TC 37/SC 3) • Cf. “old” ISO 12620 set of data categories for terminology • Language codes (TC 37/SC 2) • Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4 • On-going and future SC4 activities (TC 37/SC 4) • Meta-data for language resources • Morpho-syntax/Syntax, Discourse level annotation • NLP lexica, MT lexica • Multilingual data representations (e.g. translation memories) and access (query languages)

  22. Technical background • ISO 11179 (ISO JTC 1/SC 32): meta-data registry view • Provide mechanisms for the management of data categories • ISO 16642 (ISO TC 37/SC 3): terminology view • Provides ways of dealing with multilingual issues • OWL (W3C Sem. Web activity): ontology view • Provides a framework for dealing with hierarchies and expressing constraints on data-categories • E.g. a /noun/ can be described by means of /gender/ and /number/ in French

  23. Conceptual domain Data element concept Value domain Data element XML schema declaration Relation to ISO 11179 Set of Simple datcats Complex datcat /masculine/ /feminine/ /neuter/ /gender/ XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n <w lemme=“vert” gen=“f”>verte</w>

  24. The ISO 12620-1 proposal Entry Identifier: gender Profile: morpho-syntax Definition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi) Definition (en): Grammatical category… (Source: TLFi (Trad.)) Conceptual Domain: {/feminine/, /masculine/, /neuter/} Object Language: fr Name: genre Conceptual Domain: {/feminine/, /masculine/} Object Language: en Name: gender Object Language: de Name: Geschlecht Conceptual Domain: {/feminine/, /masculine/, /neuter/}

  25. Perspectives • ISO/TC 37/SC 4 in a wider picture • Basic building blocks to bring coherence in the representation of linguistic information in a variety of application domains • E.g. e-documentation, e-learning, e-business (e-catalogues), multimedia, localisation… • Provide vertical solution to linguistically based applications • E.g. Information extraction, indexing

More Related