10 likes | 128 Views
Clifford B. Anderson Curator of Special Collections Princeton Theological Seminary Library Princeton, NJ (USA). Improving Metadata Quality: An Iterative Approach Using Scrum. 4. Conclusion. The digital team approaches the problem of quality control in four stages:
E N D
Clifford B. AndersonCurator of Special CollectionsPrinceton Theological Seminary Library Princeton, NJ (USA) Improving Metadata Quality: An Iterative Approach Using Scrum 4 Conclusion • The digital team approaches theproblem of quality control in fourstages: • Identifying metadataproblems and lacunae • Prioritizing the most importantstories and sending the rest tothe product backlog • Improving the product by implementing the metadata story during a single sprint, if possible • Assessing the outcome and continuing to identify new metadata problems and lacunae. • An iterative approach allows us to regularly improve metadata quality while continuously reevaluating our metadata priorities in response to stakeholders’ needs. 3 Results • Here are some lessons we’ve learned as a team: • The deeper you dig, the more you will find to improve. The quality improvement process is infinite. • Aim shallow rather than deep so that you cover all documents in a single sprint. • Keep a sharp focus—scope creep also affects metadata cleanup. • When merging metadata from different sources, watch out for inconsistencies (even when all validate against the same schema). • Don’t be parsimonious—if you think you might need some external metadata later, just build it into your document with a different namespace. • Visualizing metadata problems can help you to understand them more intuitively (see Fig. 3) • Aim to increase the percentage of computational approaches (hurray for XQuery!) with every sprint. • Set user expectations with respect to metadata flaws (see Fig. 1) 1 Background Fig. 3: An iterative cycle A primary challenge when developing a large scale digital library is balancing metadata quantity and quality. The Theological Commons is a digital library (built on a MarkLogic Server) with ~50,000 books (or ~16,000,000 pages). Since its release in March 2012, we have been iteratively improving its metadata quality using the Scrum process. Fig. 1: Metadata for a search result: example of framing user expectations 2 Methods Fig. 4: A scatter plot showing the est. distribution of errors in a book Scrum is a form of agile project management, which works in fixed iterations (“sprints”) to develop projects and features. As problems in metadata are identified, team members add them to the product backlog. The problems are described in story form—i.e. from the perspective of end users rather than technologists. The product owner takes responsibility for ordering these stories from most to least significant. Generally, our team tackles metadata problems using some combination of computational methods and hand editing, with a preference for the former (for obvious reasons). We seek to pass through all the data during a single sprint. At the end of the sprint, we review our work with the entire library staff. It’s not good enough to say that we’ve improved the metadata quality. We aim to demonstrate a new feature that exemplifies the improved quality of the metadata. 5 Questions • Here are some questions our team is thinking about: • Should we develop a test suite for our metadata to flag obvious errors (beyond validation)? • Are there reliable natural language processing toolkits to assist with improving automatically generated metadata (i.e. OCR)? • How can we best frame user expectations when dealing with metadata deficiencies? Thanks to the members of our digital team: Cortney Frank, Greg Murray, Donna Quick, and Chris Schwartz Fig. 3: The Theological Commons site (http://commons.ptsem.edu/)