220 likes | 364 Views
2013-05-17 - Utrecht Matej Ďu r č o, ICLTT, Vienna. Controlled Vocabularies and SMC 4 LRT Semantic Mapping in CMDI. Activities : CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN
E N D
2013-05-17 - UtrechtMatej Ďurčo, ICLTT, Vienna ControlledVocabularies andSMC4LRT Semantic Mapping in CMDI
Activities: • CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN • DARIAH jointtaskforceVCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community. • SMC –SemanticMapping Component a module in the CMD-Infrastructure goal: „semanticsearch“ = enhancethesearch in theheterogeneousdatacollection (of CMDI) a) byexploitingtheshareddatacategories (SMC on schemalevel) b) byexpressingthedata in RDF (SMC on instancelevel) Context
Context II - CLARIN-AT • CCV – CLARIN Center Vienna CenterProfile CMD record http://clarin.aac.ac.at/ccv/index.html expected ready by: 2013-06 Infrastructure services: • CLARIN Metadata Repository • SMC – Semantic Mapping Component • SMC-Browser • Controlled Vocabularies engagement in CLARIN + DARIAHtask forces
4 conceptualization sketchfrom 2009 Old vision
5 • Metadata Generation, Curation • Data-Enrichment / Annotation • Data Analysis • Search (Query Expansion, autocomplete, facets etc. ) • needed for CMD2RDF- provide identifiers for entities(- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ? like equivalencies in Wikipedia (page for Johann Wolfgang Goethe):GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065) Potential usagesfor CV
RelatedActivities • DARIAH Schema Registry + Crosswalk Registry • LT-World @DFKIfull-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD). • CoNE – Control of Named Entities @MPDL/eSciDochttp://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities • EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC).http://eats.readthedocs.org/en/latest/ • TextGrid • http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html • FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Accesshttp://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
7 • Data Categories / Concepts - ISOcat • Languages - ISO-639 • Countries - countrycodes • Persons - GND, VIAF, dbpedia? • Organizations - GND, VIAF, dbpedia? • Schlagwörter/Subjects - GND, LCSH • ResourceTypology - • Tagsets!? (withmappingsbetween tags) AAT - international Architecture and Arts ThesaurusGND - Gemeinsame Norm Datei(DNB)GTAA - Gemeenschappelijke Thesaurus AudiovisueleArchieven (Common Thesaurus [for] Audiovisual Archives)VIAF - Virtual_International_Authority_File CandidateVocabularies
8 • export closed+simple DCs(perhaps even better to manually select) • Third party applications use- ISOcatfor explain() function - CLAVAS for value(/entity)-lists ISOcatand CLAVAS
informedqueryinput information about available data categories and values for those categoriescan be used as base for a complex query-input widgetwith context-sensitive autocomplete however this rather only as fallbackto autocomplete based on actual data
10 CMD RDF • Semantic Mapping on instance levelexpress MD records in RDF (for LOD)=> bind also values in MD fields to concepts • Modelling aspects • CMD Specification • Data Categories • CMD instances: - Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies => CLAVAS • Ontological Relations usednamespaces
Onestepwhen (pre)processingincomingnew MD-sets • Express MD-Records as RDF-triples: • Identify potential target Domain Ontologies/Vocabularies • Create inverted Index: • Definelookupfunction: • Enrichdatasetwithnewfacts: • Property-values of Metadata-Records are linked to individuals of domain ontologies Approach – Individuals/Instance Level <#mdrecord #property “string-value”> label → entity lookup(category, string-value) → <external-entity, measure> <#mdrecord #property #external-entity>
ResourceType, Format, AnnotationLevelType → mapto: isocat-DataCategories(Profiles: Metadata, Morphosyntax, ...) • Genre, Topic, Subject → mapto: Taxonomies, Library Classificationsystems(LCSH, DDC, Dornseiff,...) • Project, Institution, Person, Publisheropen controlledvocabularies (real entities) → mapto: CLAVAS-organisations, LT-World (perhapsothers: LCCN, DBPedia?) Candidate Categories/Properties
InstallcurrentOpenSKOSatCCV – CLARIN Center Vienna • synchronize 3 currentdatasetsvia OAI-PMH withsisterinstanceatMeertensalso totestthesynchronizationprocess (andimplications) • CMD2RDF • „specialgroupsvocabularies“ in CLARIN-AT • Plant names • Instruments Next Steps
Explanationsto SMC andCMDI Appendix
metadatafields in (completely) different profilesbut boundto (the same) datacategories (ConceptLinks) • usethislinkagewhensearching in thedatai.e. allowtheusertosearch • „in thedatacategory“ • in a MD field but also all relatedfieldsfromotherprofiles • Multiple mappinglevels: • just mapping based on the ConceptLink resolvable via ComponentRegistrydifferent elements pointing to the same DatCat • use equivalence relations between DatCats from Relation Registry • use equivalence relations also between Container DatCats • use also other relations in Relation Registry (subClassOf, almostSameAs, …) • apply selected (user defined) relation sets from Relation Registry Semantic Mapping (schemalevel) - concept
components and elements in CMD profiles are bound to data categories • the CMD records reference their profiles • in Relation Registry data categories are related to each other in separate (possibly overlapping/contradicting) relation sets CMDI linking
separate CMDI module • relies on informationfromComponentRegistry, DCR, RelationRegistry • isusedbyMetadata Repository / Service / Browser • Task: resolution: dcrIndex↔ cmdIndex dcrIndex :: (abstract) data category defined in DCRcmdIndex :: path to a field in a MDRecord • (different from - queryexpansion: CQL(datcat) → CQL(cmdIndex[]) - querytranslation: e.g. CQL → XPath Semantic Mapping Component
resourceName isocat:DC-2544 • CorpusProfile.Corpus.Metadata.Name • CorpusProfile.Corpus.SourceList.Source.Name • collection.GeneralInfo.Name • Session.Name • imdi-corpus.Corpus.Name • ToolService.GeneralInfo.Name • GTRP.Collection.GeneralInfo.Name • DIDDD.Collection.GeneralInfo.Name • Soundbites.Collection.GeneralInfo.Name • DynaSAND.Collection.GeneralInfo.Name BUT: • CMD Element: „Name“ • http://www.isocat.org/datcat/DC-2544 • http://www.isocat.org/datcat/DC-2536 • http://www.isocat.org/datcat/DC-4160 • http://www.isocat.org/datcat/DC-4176 • http://www.isocat.org/datcat/DC-4180 • http://purl.org/dc/elements/1.1/rights • http://purl.org/dc/elements/1.1/contributor • http://www.isocat.org/datcat/DC-2454 • http://www.isocat.org/datcat/DC-2557 • … Examples of DCR use in CMD metadata
languageID isocat:DC-2482 • LrtInventoryResource.LrtCommon.Languages.ISO639.iso-639-3-code • Session.MDGroup.Content.Content_Languages.Content_Language.Id • Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id • Session.Resources.WrittenResource.LanguageId • ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-code • ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-code • GTRP.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code • DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code • DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code • languageName isocat:DC-2484 • ToolService.Documentation.DocumentationLanguages.Language.LanguageName • ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName • GTRP.Collection.DocumentationLanguages.Language.LanguageName • DIDDD.Collection.DocumentationLanguages.Language.LanguageName • DynaSAND.Collection.DocumentationLanguages.Language.LanguageName • dct:language • OLAC-DcmiTerms.language • metadataLanguage isocat:DC-2543 • CorpusProfile.Corpus.Metadata • dominantLanguage isocat:DC-2468 • Session.MDGroup.Content.Content_Languages.Content_Language.Dominant • sourceLanguage isocat:DC-2494 • Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage • targetLanguage isocat:DC-2499 • Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage Examples of DCR use in CMD metadata II implementationLanguage isocat:DC-3798 - ToolService.Tool.Implementation.implementationLanguage
as of 2012-05 DCR usage in Component Registry Components structure
SMC Browser Explore the Component Metadata Framework Profile specifications from Component Registryvisualized as interactive graphs statistics (about reuse of Components) http://clarin.aac.ac.at/smc-browser/ TODO • feed with statisticsof the instance data • add relations from RELcat • add operations on graphs(intersection, difference, …)
SMC Browser Explore the Component Metadata Framework