Enrichment and Structuring of Archival Description Metadata

Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** ** National Centre for Text Mining The University of Manchester, UK Ioannis.Korkontzelos@manchester.ac.uk Sophia.Ananiadou@manchester.ac.uk * Tilburg Centre for Cognition & Communication The University of Tilburg, NL K.Zervanou@uvt.nl Antal.vdnBosch@uvt.nl

Research on Metadata • Developing standards: • collection specific (e.g. EAD, MARC21) • cross-collection (e.g. Dublin Core) • Provide mappings: • across schemas • ontologies (ad hoc or standard CDOC-CRM) • Discard metadata for IR (Koolen et al., 2007) • Exploit metadata for IR (Zhang&Kamps, 2009)

The IISH EAD dataset • EAD: XML standard for encoding archival descriptions • Challenges: • Variety of languages used • Varying type and amount of information • Style: enumerations, lists, incomplete sentences

Motivation & Objectives • Improved search and retrieval • content-based metadata document clustering • content-based/semantic search • support exploratory search • link across collections, metadata formats & institutions • create unified metadata knowledge resources

Method overview

Pre-processing • EAD/XML element selection & extraction • EAD elements containing free-text & archive content information • Language identification (n-gram method) • Identifier trained on Europarl corpus • Text snippets length: ~20 tokens

Snippet length based on language

Method overview

Enrichment & Structuring • Topic detection: Automatic term recognition using C-value method • Agglomerative hierarchical term clustering: • complete, single & average linkage criteria • document co-occurence & lexical similarity measures

Method overview

Term results (auto eval)

Results • C-value best performance: candidates that occur as non-nested at least once • Average linkage criterion & Doc Co-occurence: provide broader and richer hierarchies

Questions? Check-out our poster!

Enrichment and Structuring of Archival Description Metadata