1 / 26

ISOcat Data Category Registry Defining widely accepted linguistic concepts

ISOcat Data Category Registry Defining widely accepted linguistic concepts. Menzo Windhouwer. ISOcat: a reference implementation. ISO 12620:2009

marged
Download Presentation

ISOcat Data Category Registry Defining widely accepted linguistic concepts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISOcatData Category RegistryDefining widely accepted linguistic concepts Menzo Windhouwer

  2. ISOcat: a reference implementation • ISO 12620:2009 • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources • ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures • ISO Technical Committee 37 • Terminology and other language and content resources

  3. ISO 24613:2008 Lexical Markup Framework partOfSpeech Lemma writtenForm writtenForm Word Form grammaticalNumber lexicalType Lexicon 1..* Lexical Entry 1..* 0..* Form Sense 0..*

  4. Data categories • “result of the specification of a given data field ” (ISO 12620:2009) • data element concept (ISO 11179) • “concept for which the definition, identification and conceptual domain are specified independently of any particular representation” • complex data categories are data element concepts

  5. Data category types complex: open constrained closed writtenForm grammaticalGender email string string string Constraint: .+@.+ neuter feminine masculine simple:

  6. Data category relationships • Value domain membership • Subsumption relationships between simple data categories • Relationships between complex data categories are not stored in the DCR partOfSpeech string pronoun personal pronoun

  7. Data category specification • Administration Information Section • Description Section • Data Element Name • Language Section • Name Section • Conceptual Domain • Linguistic Section • Conceptual Domain Mandatory: A mnemonic identifier An English definition An English name A conceptual domain

  8. Guidelines for data categories (I) • Identifier: • camel case and XML-valid element name (without a namespace) • partOfSpeech • my:POS, 123POS • Data Element Name: • language independent name for the data category used in a specific application domain (specified in the source) • PoS in TBX

  9. Guidelines for data categories (II) • Name Section in a Language Section • legible name • ‘part of speech’ in the English language section • ‘partie du discours’ in the French language section • Definition: • intentional definitions (ISO 704) • should consist of a single sentence fragment • Source: • add a source for any quoted material

  10. Guidelines for data categories (III) • Justification: • a simple statement justifying the relevance of the data category to the field of language resources • especially needed for standardization

  11. Private versus standard • The standard subset of data categories in the registry should be coherent • The coherency is guarded by Thematic Domain Groups and the DCR Board • Standard data categories need to meet some more constraints then private ones: • mandatory justification • DC relations demand profile overlap • …

  12. Data Category Selections • Anyone • can register with ISOcat • can create data categories • can create data category selections (DCSs) • can share DCSs • can make DCSs public • may submit DCSs for standardization

  13. Profiles versus DCSs • Profile membership is part of the DC specification • the profile indicates the thematic domain of the DC • the profile view in the UI is created by a query • there are a limited number of profiles • A DCS is a collection of DCs • hand picked by an user for a specific purpose • can contain DCs from various profiles • there can be an unlimited number of DCSs • There isn’t (yet) a profile specific view on a DCS

  14. ISO standardization process Submission group Thematic Domain Group Data Category Registry Board Stewardship group Evaluation Validation Publication ISO

  15. Submission group • The owner, possibly together with a group of users, which submit a DCS for standardization • The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible) • justification • profile(s) • …

  16. Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification • TDGs are the owner and guardians of a coherent subset of the DCR • TDGs own one or more profiles • Each TDG has a chair • A number of judges (assigned by SC P members) • A number of expert members (up to 50%) • TDGs are constituted at the TC37/SC plenary • NewTDGs need to be proposed by a SC

  17. Harmonization • When a DC belongs to multiple profiles belonging to different TDGs harmonization may be needed • one TDG becomes the owner of the DC • judges from the other TDG(s) are involved in the evaluation process

  18. Stewardship group • Members of the TDG who will maintain the data category • The TDG becomes the owner of a standardized data category • Changes to the data category need to go through the standardization procedure (evaluation by the TDG, validation by the DCR Board)

  19. Using data categories (I) • Each data category has a Persistent Identifier (PID): http://www.isocat.org/datcat/DC-1297 • once a data category has been created it can never be deleted only deprecated or superseded • the registration authority of 12620 is obliged to keep these URLs working

  20. Using data categories (II) • This PID can be embedded in the schemata of linguistic resources: • CMD <CMD_Element name="Role" ValueScheme="string" ConceptLink="…/DC-1234"> • Relax NG <rng:element name="gender" dcr:datcat="…/DC-1297"> • XML Schema, TEI ODD, TBX, RDF, XML, … • DC Reference vocabulary: • http://www.isocat.org/12620/

  21. Using data categories (III) • The full data category specification can be downloaded from ISOcat in the Data Category Interchange Format (DCIF) • DCIF is based on a simplified version of the DCR data model, and leaves out some administrative information • DCIF vocabulary: • http://www.isocat.org/12620/

  22. Usage scenarios • DC references only: • find semantic overlap between two or more resources by comparing their DC references • DC references and a schema/component registry: • find interesting resource (types) by comparing the DC references of schemas/components in the registry • DC references and a network of registries: • find (in)direct related resources by related DCs

  23. Relation Registry • ISOcat contains a ‘flat’ list of concepts • The Relation Registry will support storing (user-specific) relations between these concepts • is-a • part-of • equivalent-to • related-to • … Will support: Ontologies and taxonomies on top of data categories Searches across related data categories …

  24. Registry network MPI RR Typological Database System RR Relation registries MPI DCR ISO DCR Data category registries resource TDS database MPI archive Linguistic resources

  25. Status of ISOcat • ISOcat is under active development: • Now: • You can access public data categories and selections • You can create your own data categories and selections • You can share your data categories and selections with others (everyone, or a specified group) • Future: • Some social features (forum to discuss specific data categories) • Cleanup of profiles by TDGs • Import external ‘data category’ sets, such as: • parts of the ISO Concept Database • Dublin Core • TEI • Standardization workflow • High availability (mirrors) • Relation registry

  26. Thanks for your attention! http://www.isocat.org/ isocat@mpi.nl Menzo.Windhouwer@mpi.nl

More Related