400 likes | 527 Views
Accommodating Diverse Search Requirements over a Fedora Repository. Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008. Background. Indiana University Digital Library Program Started in 1997 Diversity of formats and collections
E N D
Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008
Fedora Users Group - Open Repositories 2008 Background • Indiana University Digital Library Program • Started in 1997 • Diversity of formats and collections • Text, image, musical scores, audio, video, … • Diversity of search systems • DLXS, XTF, Lucene, DB2 NSE, Oracle Text • Current project to unify architecture for storage, discovery, and delivery around Fedora
Fedora Users Group - Open Repositories 2008 Search System Development • Phase one: create a search architecture and template for an image based search and discovery application • Phase two: extend the template and architecture to support more advanced search and discovery applications over different object types
Fedora Users Group - Open Repositories 2008 PHASE I: CREATING A BASIC IMAGE SEARCH
Fedora Users Group - Open Repositories 2008 Phase One: Simple Image Search • Slocum puzzle collection: ideal test case • Small number of objects • Simple content model • Each object represents a single physical puzzle • Basic metadata: METS, MODS, DC • RELS-EXT isMemberOf relationship with a collection object • Pre-scaled derivative images
Fedora Users Group - Open Repositories 2008 Requirements: Identifier Resolution • External Identifiers rather than Fedora PIDs • Seamless migration to Fedora • No commitment to any underlying repository architecture • Requirement: Quickly resolve our identifier (PURL) to the Fedora PID
Fedora Users Group - Open Repositories 2008 Requirements: PURL Identifier Resolution http://purl.dlib.indiana.edu/iudl/lilly/slocum/thumbnail/LL-SLO-004696 OCLC PURL Resolver Hypothetical ID Resolution Service http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:19794/THUMBNAIL
Fedora Users Group - Open Repositories 2008 Requirements: Keyword and Fielded Search • Very basic search requirements for any discovery and delivery web application • Keyword search should maximize discovery • MODS fields should be searchable to maximize accuracy of matches • Search results paging • Support for simple Boolean operators • Wildcard searches are a requirement • Full metadata record (MODS) returned
Fedora Users Group - Open Repositories 2008 Remaining Requirements • User interface • Extensible, Reusable, Customizable • Service oriented approach • Centralize core search system • Standards-based access for integration with other services and end-user tools
Fedora Users Group - Open Repositories 2008 Requirements: Search System UI Layer Search Layer Slocum Webapp PURL Resolution Fielded Search Fedora Integration Generic Search Webapp
Fedora Users Group - Open Repositories 2008 Solutions: Search Protocol • Search and Retrieve via URL (SRU) • One of very few standard search protocols • Extremely powerful and flexible query language (CQL) • Can return records of any type • Most commonly used with DC, MODS, MARCXML • Has mechanisms for extension in case special needs arise
Fedora Users Group - Open Repositories 2008 Search System Solutions: SRU UI Layer Search Layer Slocum Webapp PURL Resolution Fielded Search SRU Fedora Integration Generic Search Webapp SRU
Fedora Users Group - Open Repositories 2008 Solutions: Existing Products • Fedora Search • Good for finding items based on basic Fedora metadata, but not for more sophisticated searching • Fedora Resource Index Search • Also limited to searching basic metadata, not the content of datastreams
Fedora Users Group - Open Repositories 2008 Solutions: Existing Products • Fedora Generic Search Service (GSearch) • Hooks into Fedora • Works with Lucene • Easy to customize search fields though XSLT transformation of existing metadata • OCLC SRU/W Implementation • Relatively complete implementation in Java, with ongoing development • Others have had success using with Lucene
Fedora Users Group - Open Repositories 2008 Search System OCLC SRU Implementation Fedora Generic Search Service SRU Lucene Database extension Updates Reads index
Fedora Users Group - Open Repositories 2008 Phase 1 Solution: General Applicability • Pieces of this solution have been used for other image collections • SRU is used to expose these collections to OneSearch@IU, our federated search service • The XSLT that assigned metadata to Lucene index fields was a solid base for the indexing needs of other collections.
Fedora Users Group - Open Repositories 2008 Phase 1 Solution: Lingering Problems • Our XSLT for the Generic Search Service wasn’t perfect • Some complications prevented full automation • We punted on getting the perfect Lucene analyzer configuration
Fedora Users Group - Open Repositories 2008 PHASE II: EXTENDING FOR DIFFERENT COLLECTIONS
Fedora Users Group - Open Repositories 2008 EVIA Digital Archive
Fedora Users Group - Open Repositories 2008 Requirement: EVIADA Video Annotation Collection Field Collection Video Object Custom Annotation Software Video Object Video Object Field Collection Object
Fedora Users Group - Open Repositories 2008 Requirement: EVIADA Video Annotation Collection • Complex Data model • One Fedora object which is addressable and discoverable in parts • New features • Faceted Search and Browse • Extensive custom fields
Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection
Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection • Complex Content model • Three types of objects below the collection • Sheet music • Individual Score • Page Image Chariot Race March
Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection • New Features • Faceted Search and Browse • Exact match searches • Date range searches • Dozens of very specific fields • Sorting by date or title
Fedora Users Group - Open Repositories 2008 Options: • Extend our existing implementation • All too appealing because of familiarity and “sunk costs” • Major conflicts between existing model and desired model could result in unmaintainable “hackish” implementations • Switch to a new infrastructure • Would be great, if something existed that met our needs without having to rework everything • Some combination • Best of both worlds?
Fedora Users Group - Open Repositories 2008 Options: Faceted Search and Browse • Use Solr • Built-in support for facets • Is a service layer with an XML response • But do we really want to abandon SRU, or maintain two search service protocols?
Fedora Users Group - Open Repositories 2008 Options: Faceted Search and Browse • Extend SRU Implementation • Prevents the need for yet another service layer • Has wide reuse potential • Could be backed by Solr without substantially more effort.
Fedora Users Group - Open Repositories 2008 Solution: Faceted Search over SRU SRU Service (now with facet support)
Fedora Users Group - Open Repositories 2008 Solution: Other SRU Improvements • More complete CQL support • Easy Improvements • Operators (and, or, not, any, all) • Application-specific fields
Fedora Users Group - Open Repositories 2008 Solutions: Other SRU Improvements • More complete CQL support • Difficult Improvements • “cql.exact” relation • facet implementation • sort support dc.subject dc.subject exact “United Kingdom” dc.subject.exact dc.subject dc.subject.sort index
Fedora Users Group - Open Repositories 2008 Options: Index Generation Fedora Generic Search Service Homegrown Solution
Fedora Users Group - Open Repositories 2008 Reconsideration: GSearch • Limited by the one to one relationship between Lucene documents and fedora objects • Storing valid XML in CDATA to be stored in Lucene is messy and is prone to error as the metadata becomes more diverse • We really only use it to generate a Lucene index
Fedora Users Group - Open Repositories 2008 Consideration: Solr • Robust wrapper for Lucene • Exposes service to update index • Exposes search features as a service • Abstracts away much of the of complexities of Lucene • Migrating existing search indexes would be prohibitively time consuming, but it might be the best tool to bring up new collections
Fedora Users Group - Open Repositories 2008 Solution: Custom index service • A service whose initial functionality is simply to create and maintain Lucene Index directories that are served by SRU. • Can easily be extended/configured to use different search engines or to delegate the process entirely (perhaps to Solr) • Support for existing GSearch style XSLT • Simple Java interface to allow for easy index implementations.
Fedora Users Group - Open Repositories 2008 Search Service OCLC SRU Implementation Custom Index Service Basic Index Writer Lucene Database – configured for quick id resolution GSearch Style XSLT Index Writer Lucene Database – configured for basic search New Style XSLT Index Writer Lucene Database – configured for advanced search Compound Model Java Index Writer Lucene Database – configured for compound model searches index index index index
Fedora Users Group - Open Repositories 2008 Search Service OCLC SRU Implementation Custom Index Service Basic Index Writer Lucene Database – configured for quick id resolution G Search Style XSTL Index Writer Lucene Database – configured for basic search New Style XSTL Index Writer Lucene Database – configured for advanced search Compound Model Java Index Writer Solr Wrapping Index Lucene Database – configured for compound model searches Solr Database – configured to interface with solr. Solr index index index index
Fedora Users Group - Open Repositories 2008 Future Plans • Full Text searching • Search text of entire books or journals • Determine where in the hierarchy the match occurred • Provide snippets with highlighted matches in context for the search results listing • Solutions • XTF, Solr through our custom index service
Fedora Users Group - Open Repositories 2008 Conclusion • Most of the work is configuring the index which is a requirement that cannot be avoided. • Migration doesn’t have to be difficult or disruptive • Always be willing and able to consider new products and technologies
Fedora Users Group - Open Repositories 2008 Thanks! Any Questions? • www.dlib.indiana.edu • wiki.dlib.indiana.edu/confluence/x/AQI • midurbin@indiana.edu • jwd@indiana.edu