300 likes | 407 Views
Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry . Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl. The Language Archive. Founded in September 2011 Supported by MPG, BBAW and KNAW (DANS)
E N D
Collaboratively DefiningWidely Accepted Linguistic Data Categoriesin the ISOcat Data Category Registry Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer@dans.knaw.nl eHg - New Trends in e-Humanities
The Language Archive • Founded in September 2011 • Supported by MPG, BBAW and KNAW (DANS) • Grown out of the Technical Group at the MPI forPsycholinguistics • Since 1990ies: challenge of archiving digital data • 2000 – 2016 VolkswagenFoundation DOBES project on Endangered Languages • Active in many European infrastructure projects: CLARIN, EUDAT, DASISH, … eHg - New Trends in e-Humanities
Language Archiving Technology • Full lifecycle support • Core: resources • Key: metadata • ‘New’: CMDI, ISOcat, AV recognition, … • Archive size: • 70 Tb of resources • 22.000 hours AV recordings • 75.000 sessions (metadata) • 5 million annotated segments • 50 lexica • My focus: Knowledge Systems • LEXUS, an online lexicon tool • ISOcatandcompanions eHg - New Trends in e-Humanities
Typological Database Nijmegen TOP NOTION tds:Noun GROUPS{ NOTION tdn:GrammaticalDistinctions LABEL "Grammatical distinctions for nouns." GROUPS { NOTION tdn:AgentNouns LABEL "Agent nouns." DESCRIPTION "Nouns can function as the agent of a clause." LINK TO CONCEPT agentRole GROUPS { NOTION tdn:v098_plusAffix LABEL "Agent nouns formed by verb stem plus affix." LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix) DESCRIPTION <p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p> NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX" IS FIELD v098; ... Explicit semantics! Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS; also this not a TDN punchcard eHg - New Trends in e-Humanities
DOBES corpora Shared semantics! Explicit semantics! eHg - New Trends in e-Humanities
Oxford English Dictionary eHg - New Trends in e-Humanities Source: http://www.oxford-royale.co.uk/news/2010/12/04/new-online-edition-of-oxford-english-dictionary.html
Terminology Community of Practice • Community started out on paper (A5 fiches), just like OED • 80’s - 90’s projects to standardize data category, the ‘fields’ on the fiches/in the files/database records, names • ISO 12620:1999 Data Categories a companion standard to ISO 12200 Machine-readable terminology interchange format (MARTIF) eHg - New Trends in e-Humanities
ISO 12620:1999 eHg - New Trends in e-Humanities
Towards a Data Category Registry • Problems with ISO 12620:1999 a hardcoded list of data categories • Not easily extensible • Ordering heavily debated • Outdated and limited in range at the moment of release • Developments • In the SALT project an interchange model (TBX) based on MARTIF/data categories was created, which was widely adopted • ISO 11179 Metadata Registries was released, which describes the standardization of data element concepts for metadata • ISO released Annex ST Standards as databases, which describes an ISO procedure to standardize registry entries • In the LIRICS project a pilot Data Category Registry, SYNTAX, was created eHg - New Trends in e-Humanities
ISO 12620:2009 • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources • A data model for data category specifications inspired by ISO 11179 • A procedure to standardize data category specification compliant with Annex ST • Each data category gets a unique Persistent Identifier (PID) • The Max Planck Institute for Psycholinguistics is appointed as the Registration Authority of the ISO/TC 37 DCR • In use by a growing number of ISO TC 37 standards • Lexical Markup Framework (LMF) • Linguistic Annotation Framework (LAF) • Morph-syntactic Annotation Framework (MAF) • … • could be more, e.g., Feature System Declarations (FSD) eHg - New Trends in e-Humanities
Example Data Category specification • Data category: /Grammatical gender/ • Administrative part: • Identifier: grammaticalGender • PID: http://www.isocat.org/datcat/DC-1297 • Descriptive part: • English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. • French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels. • Linguistic part: • Morposyntax conceptual domain: /masculine/, /feminine/, /neuter/ • French conceptual domain: /masculine/, /feminine/ eHg - New Trends in e-Humanities
Standardization procedure Decision Group Submission group Thematic Domain Group Data Category Registry Board Stewardship group Evaluation Validation rejected rejected Publication eHg - New Trends in e-Humanities
Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics • TDGs are the owner and guardians of a coherent subset of the DCR • TDGs own one or more profiles • Each TDG has a chair • A number of members assigned by SC P members • A number of expert members invited by the chair (up to 50%) • TDGs are constituted at the TC37/SC plenary • NewTDGs need to be proposed by a SC • Translation • (Sign language) eHg - New Trends in e-Humanities
ISOcat - the ISO TC 37/DCR • A (coherent) set of Data Categories, in our case for linguistic resources • A system to manage this set: • Create and edit Data Categories • Share Data Categories, e.g., resolve PID references • Standardize Data Categories • An API for tools to access the DCR • Grass roots approach • Anyone can access the DCR and use or create the data categories (s)he needs eHg - New Trends in e-Humanities
Refering to ISOcat data categories • PIDs of data categories can easily embedded in XML documents <lmf:LexicalEntry> <tei:f name="partOfSpeech" dcr:datcat="http://www.isocat.org/datcat/DC-1345" fVal="commonNoun” dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256"/> <lmf:Lemma type="Form"><tei:f name="writtenForm” dcr:datcat="http://www.isocat.org/datcat/DC-1836" fVal="clergyman"/> </lmf:Lemma> </lmf:LexicalEntry> • Also embedding in other formats is possible, e.g., via comments • Preferably annotate schemas, so a whole range of resources is annotated in one go eHg - New Trends in e-Humanities
A glimpse of ISOcat eHg - New Trends in e-Humanities
Collaboration in ISOcat • Registered user can contact eachother via mediated email • Ask the owner if a data category can be adapted a little to your needs • Registered users can start up a group and invite other users to join • Work together on a set of data categories • Interact via a public and/or private forum • A group can submit data categories for ISO standardization eHg - New Trends in e-Humanities
Component MetaData Infrastructure • CMDI is developed by CLARIN and on its way to standardization by ISO TC 37 • Limitations existing metadata schemas: DC/OLAC, IMDI, TEI header • Inflexible: too many (IMDI) or too few (OLAC) metadata elements • Limited interoperability (both semantic and syntactic) • Problematic (unfamiliar) terminology for some sub-communities. • Limited support for LT tool & services descriptions • The idea is to address this by: • Explicit defined schema & semantics • User/project/community defined components eHg - New Trends in e-Humanities
CMDI architecture ISOcat metadata catalogue component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities
Athens Core • Bootstrapped the Metadata data categories selection in ISOcat • Based on existing metadata standards, e.g., DC, OLAC, IMDI, TEI • Many translations in european languages • Users add the data categories they need to the Metadata profile and use them in CMDI eHg - New Trends in e-Humanities
CMDI architecture ISOcat metadata catalogue component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities
CMDI architecture ISOcat metadata catalogues (VLO, MI) component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities
CMDI (intermediate) results • Diverse metadata profiles • Center or projects create specific ones, but reuses components where possible • Shared and explicit semantics help to overcome • Terminological differences • Differences in structure • Future • Get more context sensitive • e.g. documentation language vs. speaker language • Crosswalks • equivalent metadata data categories are easilyintroduceddueto the open nature of ISOcat • User specific relationships • e.g. theory specific differences can be more important to one user then another eHg - New Trends in e-Humanities
Metadata TDG • Standardization efforts of the Metadata TDG stalled • Large overlap with the work/people at the Athens-Core meetings • Community level agreement is maybe enough • Activity motivation should not depend on one person, the TDG chair, only • The need for explicit and shared semantics is not clear enough yet … more evangelization needed • Unfamiliarity with the work • Terminologists are more used to this kind of review work • Online review vs. old ISO ‘paper’ process • Members have little time, it is difficult to sync schedules • TDG experts tend to be senior scientist • Continuous process vs. sporadic bursts of activity • Unpaid work • Project funding vs. wide acceptance in the community • However, a project might bootstrap a thematic domain • The same problems hold for other TDGs • Current tendency to tie data category (selection) standardization to a new/revised standard, e.g., MAF and TBX • Redesign of the standardization process is coming up • ISO is not actively supporting Annex ST Standards as Databases anymore eHg - New Trends in e-Humanities
Community efforts • LMF-related: UBY, RELISH/GOLD • Sign Language • CLARIN • CMDI, Athens Core • CLARIN-NL/VL • Call 1 – 4 projects created CMDI and annotated resources/schemas • ISOcat content coordinator: Ineke Schuurman • Tutorials, guidelines (do’s and don’ts) and feedback • Better community support in ISOcat • Views, e.g., CLARIN-NL/VL • Recommended by, e.g., DC-4949 • … eHg - New Trends in e-Humanities
Conclusions and future work • Communties can already create a coherent view on ISOcat • the CMDI use case shows potential • maybefunder support needed to bootstrap specific domains • The standardized core will take (a long) time • like all standardization work • Next tometadataalso content • explicit semanticswouldbeprofitable even whennot shared and/or usedfor resource discovery • resources createdwith tools that support ISOcatwillcreatesuch resources more easy • Companion registries: • relations between data categories (RELcat) • annotated schemas for language resources (SCHEMAcat) • interactionwith the CLARIN vocabulary service (CLAVAS) • Data categories vs. concepts eHg - New Trends in e-Humanities
Detour: ISOcat and LOD/Semantic Web • Archives and infrastructures look at the resources as they are, i.e., in general no conversions to triples • However, ISOcat data categories can easily be used in RDF resources :partOfSpeechdcr:datcat <http://www.isocat.org/datcat/DC-396> ; rdfs:label "part of speech"@en ; rdfs:comment "A category assigned to a word based on its grammatical and semantic properties."@en . • The Relation Registry, which is a tripple store, will in general support lightweight, semi-formal ontologies M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012. eHg - New Trends in e-Humanities
Thank you for your attention! Visit www.isocat.org Questions? www.isocat.org/forum/ isocat@mpi.nl Acknowledgements Thanks to anyone at TLA, Sue Ellen Wright, InekeSchuurman, Marc Kemps-Snijders, CLARIN-NL, CLARIN, ISO TC 37 eHg - New Trends in e-Humanities
A whole litter of cats! Linguistic resource (schema) Linguistic knowledge base Data categories Containers Concepts Relation Schema Registry - SCHEMAcat Data Category Registry - ISOcat Concept Registry Relation Registry - RELcat eHg - New Trends in e-Humanities
ISO 11179: concepts vs. data elements/categories ISO 12620 Data Categories eHg - New Trends in e-Humanities