260 likes | 363 Views
Standards for language resources the ISO/TC 37(/SC 4) perspective. Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair. Context. ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif
E N D
Standards for language resources the ISO/TC 37(/SC 4) perspective Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair
Context • ISO TC37 - Terminology and other language resources • SC3 - Computer applications in terminology • ISO 12200 - Martif • ISO 12620 - Data categories (under revision) • ISO 16642 - TMF (Terminological Markup Framework) • SC4 - Language Resource Management www.tc37sc4.org
An example scenario: information extraction Semantic content Content analysis Syntactic structures Chunk parsing Part-of-speech tagging POS tagging Primary Data
XML RDF Horizontal view(W3C perspective) Semantic content OWL Content analysis Syntactic structures Chunk parsing Part-of-speech tagging SOAP POS tagging Primary Data
Vertical view(ISO/TC 37/SC 4 perspective) Semantic content Content analysis Evaluation Linguistic models and descriptors (Data Categories) Syntactic structures Chunk parsing Lexica Part-of-speech tagging POS tagging Primary Data
Linguistic information sources …and initiatives Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX, XLIFF, XHTML, etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]
SC4 Approach • Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources • In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches • No provision of new formats • Situate development squarely in the framework of XML and related standards • Ensure compatibility with established and widely accepted web-based technologies • Ensure feasibility of transduction from legacy formats into newly defined formats
SC4 and other standardizing bodies Contributing organizations ----- ----- ----- • TEI • text representation • Reference for primary sources • e.g.: text archives ----- Oscar Text • W3C • basic protocols and formats • XML (Schemas) • XPath • XPointer • + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech
Data categories ISO/TC 37/SC 4 structure WG4 Lexical databases WG2 Representation schemes WG3 Multilingual text representation WG5 Workflow of language Resource Management WG1 Basic descriptors and mechanisms for language resources
On-going activities • Feature structure representation (in collaboration with the TEI - Text Encoding Initiative) • ISO DIS 24610 • Morpho-syntactic annotation • ISO NP 24611 • Lexical markup framework • ISO NP 24612 (+ ISO NP 12620-3) • Task force on Meta-data for language resources (OLAC+IMDI) • ACL/Sigsem working group on multimodal content representation • Data category registry for ISO/TC 37 • ISO CD 12620-1 on ballot (deadline Jan. 2004)
General framework - 1 • Model for linguistic annotation that can • be instantiated in a standard representational format • GMT: Generic Mapping Tool • serve as a pivot format into and out of which proprietary formats may be transduced to enable • Comparison, merging, manipulation via common tools • Reference: ISO 16642 - Terminological Markup Framework
General framework - 2 • A meta-model • A general, underlying model that informs current practice • A set of data-categories • Provides to precise semantics of the format • Obtained: • By sub-setting a Data Category Registry • By providing application specific categories
ISO 16642: A family of formats TMF … TML1 TML2 TML3 TMLi (Geneter) (TBX) GMT
Meta-model Terminological Data Collection (TDC) Global Information (GI) Complementary Information (CI) * Terminological Entry (TE) * Language Section (LS) * Term Section (TS) * Term Component Section (TCS)
TMF: example id=‘ID67’ subjectField=‘ manufacturing ’ definition=‘A value…’ TE LS LS lang=‘ hu ’ lang=‘ en ’ term=‘alpha smoothing factor’ termType=‘fullForm’ TS term=‘…’ TS
Implementation in TBX(cf. www.lisa.org) <termEntry id='ID67'> <descrip type='subjectField‘>manufacturing</descrip> <descrip type='definition'>A value between 0 and 1 used in ...</descrip> <langSet lang='en'> <tig> <term>alpha smoothing factor</term> <termNote type='termType'>fullForm</termNote> </tig> </langSet> <langSet lang='hu'> <tig> <term>Alfa ...</term> </tig> </langSet> </termEntry>
Data Category • Definition: • Elementary descriptor used in a linguistic description or annotation scheme • Example: • /Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/ • Background: • Experience gained from ISO 16642 in linguistic format specification • Wider notion of data-categories as meta-data for tagged language resources
Multiple uses of data categories Documentation Meta-data XML schemas Data category selection Meta model XSL filters
Application domains • Terminological data collection (TC 37/SC 3) • Cf. “old” ISO 12620 set of data categories for terminology • Language codes (TC 37/SC 2) • Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4 • On-going and future SC4 activities (TC 37/SC 4) • Meta-data for language resources • Morpho-syntax/Syntax, Discourse level annotation • NLP lexica, MT lexica • Multilingual data representations (e.g. translation memories) and access (query languages)
Technical background • ISO 11179 (ISO JTC 1/SC 32): meta-data registry view • Provide mechanisms for the management of data categories • ISO 16642 (ISO TC 37/SC 3): terminology view • Provides ways of dealing with multilingual issues • OWL (W3C Sem. Web activity): ontology view • Provides a framework for dealing with hierarchies and expressing constraints on data-categories • E.g. a /noun/ can be described by means of /gender/ and /number/ in French
Conceptual domain Data element concept Value domain Data element XML schema declaration Relation to ISO 11179 Set of Simple datcats Complex datcat /masculine/ /feminine/ /neuter/ /gender/ XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n <w lemme=“vert” gen=“f”>verte</w>
The ISO 12620-1 proposal Entry Identifier: gender Profile: morpho-syntax Definition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi) Definition (en): Grammatical category… (Source: TLFi (Trad.)) Conceptual Domain: {/feminine/, /masculine/, /neuter/} Object Language: fr Name: genre Conceptual Domain: {/feminine/, /masculine/} Object Language: en Name: gender Object Language: de Name: Geschlecht Conceptual Domain: {/feminine/, /masculine/, /neuter/}
Perspectives • ISO/TC 37/SC 4 in a wider picture • Basic building blocks to bring coherence in the representation of linguistic information in a variety of application domains • E.g. e-documentation, e-learning, e-business (e-catalogues), multimedia, localisation… • Provide vertical solution to linguistically based applications • E.g. Information extraction, indexing