280 likes | 401 Views
SemanTic Interoperability To access Cultural Heritage. Lourens van der Meij Antoine Isaac Marjolein van Gendt OLP AIO Workshop January 27th 2006. Outline. Pilot Project introduction Goals Collection selection Mapping aspect of Pilot Project Thesauri formalisation Mapping tools
E N D
SemanTic Interoperability To access Cultural Heritage Lourens van der Meij Antoine Isaac Marjolein van Gendt OLP AIO Workshop January 27th 2006
STITCH Pilot Project Outline • Pilot Project introduction • Goals • Collection selection • Mapping aspect of Pilot Project • Thesauri formalisation • Mapping tools • Output of mapping task • Lessons learned
STITCH Pilot Project Current Cultural Heritage (CH) Situation
STITCH Pilot Project Research and development in CH • Portals for heterogeneous collections access Different databases/vocabularies/MD schemes • Syntactic interoperability Access can be granted • Semantic interoperability Links with original vocabularies/MD structures are lost
STITCH Pilot Project Current Development in CH
STITCH Pilot Project Pilot Project Goals • Show in a small use case, using • Two Cultural Heritage collections • Two controlled vocabularies • Existing mapping tools • Existing SW techniques – SKOS, RDF, RDFS, Sesame • (representation, reasoning, storage, mapping) • That: Semantic links between controlled vocabularies can result in integrated access to heterogeneous Cultural Heritage collections
STITCH Pilot Project STITCH ultimate goal
STITCH Pilot Project Pilot Project Modules
STITCH Pilot Project Collection selection (1/3) • Domain: Cultural Heritage • Collections: • Medieval Illuminated Manuscripts from KB • Masterpieces from Rijksmuseum
STITCH Pilot Project Collection selection (2/3) • Controlled vocabularies: Iconclass • Illuminated Manuscripts • > 24.000 concepts • 10+ levels • Keys • Structural digits • Cross-references • Bracketed text • etc. Lupus (wolf) Fol. 62r: column min. 50x60 Iconclass:25F23(WOLF)47I2133
STITCH Pilot Project Title The Artist Painting a Cow in a Meadow Landscape Year 1850 Artist Hendrikus van de Sande Bakhuyzen Technique Oil on panel Dimensions 73,2 x 96,7 cm Object number SK-A-4163 Catalogue Man, Self portraits, Cattle, Dutch landscapes, Fields and meadows Collection selection (3/3) • Controlled vocabularies: ARIA • Masterpieces • <500 terms, some of them redundant • 2-levels • Fuzzy multi-inheritance • Top and Topia Terms
STITCH Pilot Project Outline • Pilot Project introduction • Goals • Collection selection • Mapping aspect of Pilot Project • Thesauri formalisation • Mapping tools • Output of mapping task • Lessons learned
STITCH Pilot Project Thesauri Formalisation • ARIA • CHIP issued a SKOS version • Only used Topia Terms • Iconclass • SKOS • Only used basic hierarchy • No keys/structural digits/keywords
STITCH Pilot Project Mapping tools • S-Match, Trento • Required input: TAB indented trees • Tree-like structures mapper • http://dit.unitn.it/~accord/ • Falcon-AO, Nanjing • Required input : • Standard RDFS Class/subClassOf • Subdivision of Iconclass • Standard OWL ontology mapper • http://xobjects.seu.edu.cn/project/falcon/falcon.htm • Method • Lexical/element level matching • Oracle (e.g. Wordnet) • Structure matching
STITCH Pilot Project Output of mapping task • Output format • S-Match: • Less General • More General • Equivalence • Iconclass vs. ARIA only gives IC LG ARIA • Falcon AO: • Equivalence • Confidence measure (always 1) • Sequence of mappings might indicate usefulness • Application specific requirements • UI needs precision • Annotators might need recall
STITCH Pilot Project Output of mapping task (S-Match) – nice results
STITCH Pilot Project Output of mapping task (S-Match) – awful results
STITCH Pilot Project Lessons learned • Annotation of results • Lexical matching • Gloss vs. label • NLP • Non-convenient priorities are given to lexical elements • rdfs:label vs. rdf:about/ID • Oracle based matching • Wordnet Sense Disambiguation • Structure based matching • Structure overvaluation (BT vs. NT vs. EQ) • Thesaurus simplicity makes it (almost?) useless • No attributes, fuzzy hierarchies • Differences in hierarchical structure levels • Complex structure-based algorithms are not always intuitive
STITCH Pilot Project Lessons learned • Annotation of results (contn’d) • Output format • Wrong kind of relation (RT, siblings) • 1-1 mapping • Precision: • S-Match: 41% (subset of IC) • Falcon-AO from 1 out of 1000 (subset of IC) • To 5% if data tricked • To 52% if artificial but realistic threshold is introduced • Manual cleaning needed for use in UI • Expert mapping • Size of vocabularies • Ambiguous e.g.: is Nature/World as celestial body/Animals equal to or a subclass of Animals? • To be continued
STITCH Pilot Project Lessons learned: Improvements • Lexical matching • Introduce NLP • Let only complete concepts match • …. Further research (decipher black-boxes) • Oracle based matching • Stricter Wordnet interpretation • Include other oracles • Structure based matching • Create thesaurus based structure mapping (RT, keywords, siblings) • ….. Further research (decipher black-boxes)
STITCH Pilot Project Lessons learned: Conclusion We have ontology mappers, not thesaurus mappers • Input: needs pre-processing from thesaurus data • Output: needs re-interpretation of mapping relations • Mapping process • Using resources that may be absent from thesauri • E.g. properties • Not (properly) using all information found in thesauri • E.g. synonyms, RT, textual descriptions Leads to ‘low-quality’ thesaurus mapping
STITCH Pilot Project Thanks! Any questions? ? User Interface Future work
STITCH Pilot Project Collections Access: Single View • Facets based on 1 point of view and its associated concept scheme(s) • Access to objects indexed against concepts from other schemes • If mapping between their index and the concepts from single view A single point of view on integrated data set
STITCH Pilot Project Collections Access: Combined View • Search based on 2 points of view • One facet uses 1 vocabulary from 1 point of view • Facets attached to the different points of view are presented • Simultaneous access to different points of view of the same data
STITCH Pilot Project Collections Access: Merged View • Facets using a merged concept scheme • Mapping leads to hierarchical links between schemes • Making the links between vocabularies more visible during search • A way to ‘enrich’ weakly structured vocabularies
STITCH Pilot Project Future work • A lot to do for the rest of STITCH! • Method • Thinking about roadmap for using ontology matching techniques for CH voc. • Taking into account MD schemes (structure) • Evaluation of mappings • Use cases • KB • Other institutions and projects • Practical • Scalability of tools • Deployment for SW data (distributed/centralized) • Implementation of thesaurus-specific (adaptations of) tools
STITCH Pilot Project Future work • Concerning PP: • Mappings • Assessing criteria for proper application-specific evaluation • (Keep on) tuning tools to obtain better results for PP collections • Interface • Dynamic view switching/facet activation • Better use of all kinds of exploitable relationships • RT-like • Expert evaluation of the whole prototype • Integrating other collections
STITCH Pilot Project What’s a thesaurus (Wikipedia) • A list of every important term (single-word or multi-word) in a given domain of knowledge; and • A set of related terms for each term in the list. • Possible relations and additions: • Scope Note • Related Term (RT) • Broader Term (BT) • Narrower Term (NT) • BT and NT are reciprocals • Use (USE) = non-preferred term -> preferred term • Used For (UF) = preferred term -> non-preferred term