460 likes | 478 Views
Quality Enhancement in Metadata Repositories. Metadata Repositories for Interoperable /Shareable Metadata Part 2. A. The case of metadata repositories. A New Model for Multipurpose Union Catalog. Metadata repositories are established on open source principles
E N D
Quality Enhancement in Metadata Repositories Metadata Repositories for Interoperable/Shareable Metadata Part 2.
A New Model for Multipurpose Union Catalog • Metadata repositories are established on open source principles • Metadata repositories encourage simplicity and productivity • by inviting participating units and authors to contribute based on their own practices
Metadata Repositories:Loosely-controlled Environments • Each discrete metadata dataset retains its independent identity • can be used without the support of the union catalog • exists in two places • records may not be identical
… a dataset exists in two places; the records may not be identical. original, local merged to the repository
Metadata Repositories:Loosely-controlled Environments • Each discrete metadata dataset retains its independent identity • Multiple standards were applied
Europeana Heterogeneity: records grouped by metadata format Source: Concordia: Integration of Heterogeneous Metadata in Europeana http://dublincore.org/groups/tools/docs/LIDA09WorkshopC_1.pdf
Metadata Repositories:Loosely-controlled Environments • Each discrete metadata dataset retains its independent identity • Multiple standards were applied • Voluntary metadata sharing • no central authority control or management • Records were created by trained and untrained metadata authors
Does Metadata Quality Matter? • Our purpose is to improve information access and the use of open resources. • Low quality metadata can fail to provide access to relevant resources. • Poor searching and browsing can result in a negative impact on the cost of searching, in both time and money. • This may result in users having negative perceptions of the open repository.
Looking at quality issues • 1. Duplication problems • 2. Issues between system’s functions and the supporting data • 3. The causes of missing information • 4. Solving the problems
1. Duplication problem Different collections describing same source:
1. Duplication problems Summary: Duplicates • can be identical records, • can describe the same source, • with different metadata, or • with links to different or slightly different location identifiers • (e.g. index page vs. splash page).
2. Issues between system’s functions and the supporting data A wish list of Functions We would like users to be able to: Search records by: Title, Author name, Keyword, Type of document, Publication, Conference name, and Year Browse records by: Year, Department, Classification, Object type, Subject matter, etc.View latest additions to the archive We would like to be able to: Link together records from the same Conference, Publication Filter by: Year and Language -- Based on Guy, Powell, and Day. "Improving the Quality of Metadata in Eprint Archives." Ariadne , no. 38. 2004.
between system’s functions <== > the supporting data Let's search by Format, find Interactive Resource Does the data really support these filtered searches? Will the designed functions be supported by the data?
Will the metadata support these functions? This was an earlier interface (2003-5). 11/50 collections did not have FORMAT information.
Advanced Search Advanced Search What is actually What is actually searched. searched. What should be What should be searched. searched. Search by Format • Collections which did not provide FORMAT data are excluded from being searched image creditLuke-Chueh*
Collections which did not provide EDUCATION LEVEL information are excluded from being searched Search by Education level Advanced Search Advanced Search image creditLuke-Chueh* What is actually What is actually searched. searched. What should be What should be searched. searched. Note: education levels were added to current displays.
3. The causes of missing information (a) "There is no such element" Your metadata element set (or data dictionary) might not include this element. e.g., Your data does not include an element to hold FORMAT information.
3. The causes of missing information (b) "It is not a required element" Although the metadata element set we used does include this element … -- there is no authority or guidelines that enforce the use of this element; -- therefore, all or part of the records do not include a particular group of data.
Mars Exploration duplicate records Record #1 No date. No format, type, grade level, language, rights.
Record #2 No date. formatmissed images. No grade level, rights.
Record #3 Recorded grade level, format, type + No date, language, rights.
The causes of missing information (c) "Woops, a wrong place" Examples: • In some dataset, the information about FORMAT is put under a wrong element, e.g. TYPE, so that the values in this field will not be correctly indexed for FORMAT. • In one dataset, the values of SUBJECT were mismatched to LANGUAGE element. • In quite a few cases, the information about FORMAT is mapped to a wrong place in the repository. Wrong LANGUAGE values SUBJECT count = 0
The causes of missing information (d) Incorrect mapping/conversion The values are lost when the metadata records are converted from one database to another (due to incorrect mapping).
AUTHOR mapped to DESCRIPTION before … after … noCREATOR 3357 records !
TYPE values got lost Original Record Converted Record • These records’ TYPE values were not converted to the repository. • These resources are excluded when users search by resource TYPE.
Incorrect element mapping missed?! missed?! before … OPTIONS mapped to SUBJECT, missing all KEYWORDS after … missing keywords
after … Incorrect values before … If re-generate records based on the embedded metadata, all can be corrected.
Inappropriate mapping before … missed? missed? after … CLASSIFICATION mapped to SUBJECT and missed all the KEYWORDs.
The causes of missing information (e) "I did not use a controlled vocabulary" 3.6 megabytes 1000149 bytes language/java Application/JAVA applet Java CLASS Model/VRML AVI, MOV, QTM 1 v. (various pagings) 10 p., [6] p. of plates p.461-470 viii, 82 p. MPEG-4 Examples: values associated with FORMAT element found in the research samples: text text/html text/plain plain digital TIFF image/tiff other application/msword application/Flash (animation) ascii pdf ps
The causes of missing information (f) "I did not follow any rule" Examples from values associated with COVERAGE element (from one collection): • California • USA, California, Stanford • Kobe and Awaji-shima, Japan, • Lake Cumberland, Kentucky • Pennsylvania, Johnstown • Pittsburgh, Pennsylvania • American Midwest (from various collections): • California (United States) • Assateague Island, Virginia and Maryland, United States • Hot Spring National Park, Arkansas, United States • a-ii--- a pk--- • In----- • n-us-ga • Everglades (FL) • China (People Republic of China) • New York (NY) • New York (State)
record Enriched record Metadata Repository record 4. Solving the problems 4.1 Enriching and enhancing harvested records Aggregation
Aggregation Hillmann, et.al. (2005) identified four categories of problems that limit metadata usefulness: • Missing data: elements not present • Incorrect data: values not conforming to proper usage • Confusing data: embedded html tags, improper separation of multiple elements, etc. • Insufficient data: no indication of controlled vocabularies, formats, etc.
Aggregation It is possible that these problems be eliminated to certain level through a process called ‘aggregation’ in a metadata repository. The notion behind this process is that a metadata record, “a series of statements about resources,” can be aggregated to build a more complete profile of a resource.
Completeness measurement • Questions asked: • How many records did not provide RIGHTS information? • How many collections did not provide RIGHTS information? • How many collections have no or have <1% records that provided RIGHTS information? • Why? Count of element occurrence in the 50 collections, 180479 records after normalization. Source: Zeng and Shreve, NSF report, 2007
Sources, storage, and redistribution of augmented metadata in the metadata registry. Source: Hillmann, et al. 2005
4. Solving the problems 4.2 Correcting the errors • Checking and testing crosswalks!!! • Re-harvesting • Training how to use OAI tools • Enforcing the collaboration between the content team and the IT team
4. Solving the problems 4.3 Implement Measurement Metrics: CCCD • Completeness • minimum level • core elements • Correctness • content-info • format • input • browser interpretation • mapping/integration • redundancy • Consistency • Data Recording • Source Links • Identification and Identifiers • Description of Sources • Metadata Representation (in the search results display) • Data Syntax • Duplication • intra-collection • inter-collection Zeng, 2004. Zeng and Shreve. 2007
Can any quality measurement be done automatically? Manual Machine- assisted • (a) completeness x • (b) correctness X x • (c) consistency x X • (d) duplication x X Zeng and Shreve. 2007
4. Solving the problems Data providers: • Use well-established standard metadata element sets • Implement authority for some data values • Follow best practices guides • Use template for inputting records, with suggested syntax, vocabularies, and build-in values • Output with standardized encoded data Repository Host: • Checking and testing crosswalks!!! • Training • minimum quality requirements, • quality measurement instruments, • quality enforcement policies, • quality enhancement actions, and • the training of metadata creators. • Enforcing the collaboration between the content team and the IT team 4.4 Ensuring the Quality
4. Solving the problems • 4.5 Implementing Guidelines • See: LODE-BD
References • Hillmann, Diane I., Naomi Dushay, and Jon Phipps. (2005) Improving Metadata Quality: Augmentation and Recombination DC-2004 International Conference on Dublin Core and Metadata Applications, 11-14 October 2004, Shanghai, China. http://students.washington.edu/jtennis/dcconf/Paper_21.pdf • Zeng, Marcia Lei and Gregory Shreve. (2007). Quality Analysis of Metadata Records in the NSDL Metadata Repository. NSF Award Number DUE #0333572. A research report submitted to the National Science Foundation, 2007-02-28. 73 pages. • Best Practices for Shareable Metadata. (2005--) http://webservices.itcs.umich.edu/mediawiki/oaibp/index.php/ShareableMetadataPublic • Carpenter, L. (2003). OAI for beginners: overview. http://www.oaforum.org/tutorial/english/page1.htm • Arms, W.Y., Dushay, N., Fulker, D. & Lagoze, C. (2003). A case study in metadata harvesting: the NSDL. Library HiTech, 21(2). http://www.cs.cornell.edu/lagoze/papers/Arms-et-al-LibraryHiTech.pdf