1 / 14

Beyond ISOcat

Learn how to associate data categories with your linguistic resources using Persistent Identifiers (PIDs). Discover where to place PIDs and how to make semantics explicit using schemas, metadata, and annotation properties.

linob
Download Presentation

Beyond ISOcat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond ISOcat CLARIN-NL 2012 ISOcat tutorial

  2. Vision MPI RR Typological Database System RR Relation registries MPI DCR ISO DCR Data category registries resource TDS database MPI archive Linguistic resources CLARIN-NL 2012 ISOcat tutorial

  3. How to make semantics explicit? • Associate data categories with your resources • using the PIDs • Where to put the PIDs? • Preferably in a schema • Or in the resource itself (redundant) • Or in the metadata of the resource (less specific) CLARIN-NL 2012 ISOcat tutorial

  4. What is a schema? • “comes from the Greek word "σχήμα" (skhēma), which means shape, or more generally, plan.” (wikipedia) • A collection of building blocks and rules on how to combine them into a valid resource • XML document: • DTD, XML Schema, Relax NG, … • easy; see http://www.isocat.org/12620/ • RDF graph • annotation property • easy; see http://www.isocat.org/ns/dcr.rdf • Text document: • A grammar • Extended Backus–Naur Form (EBNF) • ... • how to embed Data Category PIDs? • … CLARIN-NL 2012 ISOcat tutorial

  5. XML resource <lmf:lexiconxml:lang=“jp” alphabet=“ipa”> <lmf:entry> <lmf:lemma> <lmf:writtenForm>nihongo</…> … </…> … </…> … </…> CLARIN-NL 2012 ISOcat tutorial

  6. XML resource <lmf:lexiconxml:lang=“jp” alphabet=“ipa”> <lmf:entry> <lmf:lemma> • <lmf:writtenForm • dcr:datcat=“http://www.isocat.org/datcat/…”> • nihongo • </…> … </…> … </…> … </…> CLARIN-NL 2012 ISOcat tutorial

  7. XML Relax NG schema <rng:attribute name=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”> <rng:valuedcr:datcat=“http://www.isocat.org/datcat/…”> ipa </…> … </…> CLARIN-NL 2012 ISOcat tutorial

  8. CGN/DCOI grammar with DC references http://lux13.mpi.nl/schemacat/schema/CGN (early alpha version) (* @dcr:datcat 'N' http://www.isocat.org/datcat/DC-4909 *) ... tag = 'N', '(', NTYPE, ',', GETAL, ',', GRAAD, ',', GENUS, ',', NAAMVAL, ')‘ ... (* @dcr:datcat NTYPE http://www.isocat.org/datcat/DC-4908 *)(* @dcr:datcat 'soortnaam' http://www.isocat.org/datcat/DC-4910*)(* @dcr:datcat 'eigennaam' http://www.isocat.org/datcat/DC-4911*)NTYPE = 'soortnaam' | 'eigennaam' ; ... CLARIN-NL 2012 ISOcat tutorial

  9. Multiple DCRs? • Actually we don’t need multiple DCRs to have overlapping subsets • Overlaps are created due to • Data categories are typed, and might not have the type you need • POS field (closed DC) of the lexical entry “walk” gets the value ‘verb’ (simple DC) • PoS = ‘verb’ • Verb (open DC) feature of a feature structure gets the value “walk” • Verb = ‘walk’ • External sets are imported just as they are • NKJP, GOLD, STTS, … • Only some take the effort to also provide mappings • There might be very fine differences between your data category and an existing one, and the owner doesn’t want to adapt • Still we would like to know that these data categories are the same or almost the same! CLARIN-NL 2012 ISOcat tutorial

  10. Relation Registry - RELcat • http://lux13.mpi.nl/relcat/ • (alpha version) • Stores user specific sets of relations: CLARIN-NL 2012 ISOcat tutorial language ID isocat:DC-2482 relcat:sameAs dc:language relcat:sameAs language name isocat:DC-2484 time coverage isocat:DC-1502 dc:coverage relcat:subClassOf

  11. Relation types • There already exist large collections of relations with their own vocabularies, e.g., OWL (2), SKOS, ... • RELcat has a basic relation type hierarchy • rel:related • rel:sameAs • rel:almostSameAs • rel:broaderThan • rel:superClassOf • rel:hasPart • rel:narrowerThan • rel:subClassOf • rel:partOf • which can be extended for other vocabularies • rel:sameAs • owl:sameAs • skos:exactMatch • rel:almostSameAs • skos:closeMatch CLARIN-NL 2012 ISOcat tutorial

  12. RELcat usage • RELcat is still in an alpha phase • no user interface yet • upload of relations via the system administrator • isocat@mpi.nl • however, there is an read-only API which is in use by (experimental) parts of the CLARIN infrastructure, e.g., the CMDI semantic mapping component CLARIN-NL 2012 ISOcat tutorial

  13. Another new kitten: SCHEMAcat • Resource schemata of any type should be stored somewhere persistently • Get a PID • These schemata are preferably annotated with data categories • SCHEMAcatISOcat • These data categories will then have (typed) relationships among each other • SCHEMAcatRELcat • Status: very early alpha, but some schemata are already available • CGN: http://lux13.mpi.nl/schemacat/schema/CGN CLARIN-NL 2012 ISOcat tutorial

  14. A whole litter! Linguistic resource (schema) Linguistic knowledge base Data categories Containers Concepts Relation Schema Registry - SCHEMAcat Data Category Registry - ISOcat Concept Registry Relation Registry - RELcat CLARIN-NL 2012 ISOcat tutorial

More Related