260 likes | 369 Views
ISOcat Data Category Registry Defining widely accepted linguistic concepts. Menzo Windhouwer. ISOcat: a reference implementation. ISO 12620:2009
E N D
ISOcatData Category RegistryDefining widely accepted linguistic concepts Menzo Windhouwer
ISOcat: a reference implementation • ISO 12620:2009 • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources • ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures • ISO Technical Committee 37 • Terminology and other language and content resources
ISO 24613:2008 Lexical Markup Framework partOfSpeech Lemma writtenForm writtenForm Word Form grammaticalNumber lexicalType Lexicon 1..* Lexical Entry 1..* 0..* Form Sense 0..*
Data categories • “result of the specification of a given data field ” (ISO 12620:2009) • data element concept (ISO 11179) • “concept for which the definition, identification and conceptual domain are specified independently of any particular representation” • complex data categories are data element concepts
Data category types complex: open constrained closed writtenForm grammaticalGender email string string string Constraint: .+@.+ neuter feminine masculine simple:
Data category relationships • Value domain membership • Subsumption relationships between simple data categories • Relationships between complex data categories are not stored in the DCR partOfSpeech string pronoun personal pronoun
Data category specification • Administration Information Section • Description Section • Data Element Name • Language Section • Name Section • Conceptual Domain • Linguistic Section • Conceptual Domain Mandatory: A mnemonic identifier An English definition An English name A conceptual domain
Guidelines for data categories (I) • Identifier: • camel case and XML-valid element name (without a namespace) • partOfSpeech • my:POS, 123POS • Data Element Name: • language independent name for the data category used in a specific application domain (specified in the source) • PoS in TBX
Guidelines for data categories (II) • Name Section in a Language Section • legible name • ‘part of speech’ in the English language section • ‘partie du discours’ in the French language section • Definition: • intentional definitions (ISO 704) • should consist of a single sentence fragment • Source: • add a source for any quoted material
Guidelines for data categories (III) • Justification: • a simple statement justifying the relevance of the data category to the field of language resources • especially needed for standardization
Private versus standard • The standard subset of data categories in the registry should be coherent • The coherency is guarded by Thematic Domain Groups and the DCR Board • Standard data categories need to meet some more constraints then private ones: • mandatory justification • DC relations demand profile overlap • …
Data Category Selections • Anyone • can register with ISOcat • can create data categories • can create data category selections (DCSs) • can share DCSs • can make DCSs public • may submit DCSs for standardization
Profiles versus DCSs • Profile membership is part of the DC specification • the profile indicates the thematic domain of the DC • the profile view in the UI is created by a query • there are a limited number of profiles • A DCS is a collection of DCs • hand picked by an user for a specific purpose • can contain DCs from various profiles • there can be an unlimited number of DCSs • There isn’t (yet) a profile specific view on a DCS
ISO standardization process Submission group Thematic Domain Group Data Category Registry Board Stewardship group Evaluation Validation Publication ISO
Submission group • The owner, possibly together with a group of users, which submit a DCS for standardization • The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible) • justification • profile(s) • …
Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification • TDGs are the owner and guardians of a coherent subset of the DCR • TDGs own one or more profiles • Each TDG has a chair • A number of judges (assigned by SC P members) • A number of expert members (up to 50%) • TDGs are constituted at the TC37/SC plenary • NewTDGs need to be proposed by a SC
Harmonization • When a DC belongs to multiple profiles belonging to different TDGs harmonization may be needed • one TDG becomes the owner of the DC • judges from the other TDG(s) are involved in the evaluation process
Stewardship group • Members of the TDG who will maintain the data category • The TDG becomes the owner of a standardized data category • Changes to the data category need to go through the standardization procedure (evaluation by the TDG, validation by the DCR Board)
Using data categories (I) • Each data category has a Persistent Identifier (PID): http://www.isocat.org/datcat/DC-1297 • once a data category has been created it can never be deleted only deprecated or superseded • the registration authority of 12620 is obliged to keep these URLs working
Using data categories (II) • This PID can be embedded in the schemata of linguistic resources: • CMD <CMD_Element name="Role" ValueScheme="string" ConceptLink="…/DC-1234"> • Relax NG <rng:element name="gender" dcr:datcat="…/DC-1297"> • XML Schema, TEI ODD, TBX, RDF, XML, … • DC Reference vocabulary: • http://www.isocat.org/12620/
Using data categories (III) • The full data category specification can be downloaded from ISOcat in the Data Category Interchange Format (DCIF) • DCIF is based on a simplified version of the DCR data model, and leaves out some administrative information • DCIF vocabulary: • http://www.isocat.org/12620/
Usage scenarios • DC references only: • find semantic overlap between two or more resources by comparing their DC references • DC references and a schema/component registry: • find interesting resource (types) by comparing the DC references of schemas/components in the registry • DC references and a network of registries: • find (in)direct related resources by related DCs
Relation Registry • ISOcat contains a ‘flat’ list of concepts • The Relation Registry will support storing (user-specific) relations between these concepts • is-a • part-of • equivalent-to • related-to • … Will support: Ontologies and taxonomies on top of data categories Searches across related data categories …
Registry network MPI RR Typological Database System RR Relation registries MPI DCR ISO DCR Data category registries resource TDS database MPI archive Linguistic resources
Status of ISOcat • ISOcat is under active development: • Now: • You can access public data categories and selections • You can create your own data categories and selections • You can share your data categories and selections with others (everyone, or a specified group) • Future: • Some social features (forum to discuss specific data categories) • Cleanup of profiles by TDGs • Import external ‘data category’ sets, such as: • parts of the ISO Concept Database • Dublin Core • TEI • Standardization workflow • High availability (mirrors) • Relation registry
Thanks for your attention! http://www.isocat.org/ isocat@mpi.nl Menzo.Windhouwer@mpi.nl