1 / 28

A Collections Searching Center Using Lucene – Solr

A Collections Searching Center Using Lucene – Solr. Ching-hsien Wang Smithsonian Institution Collections.si.edu wangch@si.edu. Background Information. Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge ,

ezra
Download Presentation

A Collections Searching Center Using Lucene – Solr

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.eduwangch@si.edu

  2. Background Information • Smithsonian Institution is a public institution whose mission is the increaseand diffusion of knowledge, • 19 museums and 9 research institutes, • 136 million collection objects, • 12 major museum collection information systems (with 30 databases), • Hundreds of other databases.

  3. Issues we faced Users want information now! • Google Effect and user’s mentality: “if it is not online, it does not exist.” • Users want immediate access to digital documents. • Separate databases are confusing to the public. We must act now!

  4. Smithsonian’s Collection Searching Center Overview • a discovery center for information with a single searching point • faceted searching and content-sensitive navigation • positive and negative browse & select options • relevancy ranking of search results • automatic stemming for word matching

  5. Smithsonian’s Cross Searching Catalog Overview (continued) • integrated searching of data from multiple types of databases • scalability for large data sets • a metadata center which interacts with other online applications

  6. Project Team and Resources • Andrew Gunther – Software development and implementation • Jim Felley – Data conversion and implementation • George Bowman – Database management and security configuration • Randy Arnold – Project support • Ching-hsien Wang – Program Manager Since August 2007, we have integrated data from 12 major databases with 2 million records.

  7. Starting from Multiple databases

  8. Transform into a single Search Center

  9. Cross Searching Demo – simple opening screen

  10. Demo – search result screen

  11. Demo – search history

  12. Virtual Museum In 2nd Life Horizon Data Extract and Trans- Formation Output data In XML Horizon Online Exhibition Solr Horizon Lucene Index Cross Searching Catalog Output data In JSON Digital Library Data Extract and Trans- Formation XML documents Solr Digital Output data In Python Education Interface Archives Digital Open Access Applications Museum Process Flow Diagram XML documents

  13. Library Trigger A Perl program converts records based on BIB# Archives Trigger Art Inventory Trigger Solr_ Index_ Pending ……. DB Table XML Documents Photo Archives Trigger Exhibition Catalogs Trigger Smithsonian History Trigger Research Bibliographies Trigger Airplane Directory Trigger Automated Process XML Data Transformation Horizon Archives

  14. Define an Index Metadata Model:Free text data fields used for Keyword searching & display Record Link Title/Object-name Identifier Physical Description Gallery Label Notes Publisher Object Type Taxonomic Name Language Topic Place Date Name Culture Set Name Data Source Credit Line Online Media Group

  15. Facet data fields used for browsing and limiting Taxon-Kingdom Taxon-Phylum Taxon-Division Taxon-Class Taxon-Order Taxon-Family Tabxon-Sub-Family Scientific_name Common name Geo-age-Era Geo-Age-System Geo-Age-Series Geo-Age-Stage Strat-Group Strat-Formation Strat-Member Record ID Object Type Language Topic Place Date Name Culture Data Source Online Media Type Rights for Online Media File Related Record Usage Flag

  16. Solr Lucene Index Solr Getting help from Solr • Task specific handlers: Request handler Respond handler Update handler • Schema.xml file defines fields to be indexed, displayed, and searchable. • Solrconfig.xml file defines cache size, faceted field type, request handler customization.

  17. Solrconfig.xml Example facet field definition • <str name="facet.field">object_type</str> • <str name="facet.field">language</str> • <str name="facet.field">topic</str> • <str name="facet.field">place</str> • <str name="facet.field">date</str> • <str name="facet.field">name</str> • <str name="facet.field">culture</str> • <str name="facet.field">online_media_type</str> • <str name="facet.field">set_name</str> • <str name="facet.field">data_source</str> • <str name="facet.field">tax_kingdom</str> • <str name="facet.field">tax_phylum</str> • <str name="facet.field">tax_division</str> • <str name="facet.field">tax_class</str> • <str name="facet.field">tax_order</str> • <str name="facet.field">tax_family</str> • <str name="facet.field">tax_sub-family</str> • <str name="facet.field">common_name</str> • <str name="facet.field">scientific_name</str> • <str name="facet.field">freetext</str> • <str name="facet.field">text</str> • </lst> • </requestHandler>

  18. Data Example (abbreviated) – a Library Book <doc boost="1"> <descriptiveNonRepeating> <record_ID>siris_sil_905285</record_ID> <unit_code>SIL</unit_code> <data_source>Smithsonian Institution Libraries</data_source> <title_sort>STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN LIFE</title_sort> <title label="Title">Story of West Point: 1802-1943; the West Point tradition in American life</title> </descriptiveNonRepeating> <descriptiveOptional> <freetext category="dataSource" label="Data Source“ >Smithsonian Institution Libraries</freetext> <freetext category="objectType" label="Type“ >Books</freetext> <freetext category="date" label="Date">1943</freetext> </descriptiveOptional> <indexedStructured> <object_type>Books</object_type> <date>1943</date> </indexedStructured> </doc>

  19. Data Example (abbreviated) – a Photograph <doc boost="6.4"> <descriptiveNonRepeating> <record_ID>siris_arc_104765</record_ID> <unit_code>EEPA</unit_code> <data_source>Eliot Elisofon Photographic Archives</data_source> <title_sort>AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE</title_sort> <title label="Title">Aerial view of downtown Johannesburg, South Africa, [slide]</title> <online_mediamediaCount="1"> <media thumbnail=http://sirismm.si.edu/eepa/eepthb/eepa_05859thb.jpg Type="Images">http://sirismm.si.edu/eepa/eep/eepa_05859.jpg< /media> </online_media> </descriptiveNonRepeating> <descriptiveOptional> <freetext category="dataSource" label="Data Source">Eliot Elisofon Photographic Archives</freetext> <freetext category="identifier" label="Local number">EEPA EECL 15973</freetext> <freetext label="photographer" category="name">Elisofon, Eliot</freetext> <freetext category="physicalDescription" label="Physical description">slide : col</freetext> <freetext category="notes" label="Summary">This photograph was taken when Eliot Elisofon was on assignment for Life magazine and traveled to Africa from August 18, 1959 to December 20, 1959</freetext> <freetext category="objectType" label="Type">Photographs</freetext> <freetext category="topic" label="Topic">Mod. architecture/cityscape</freetext> <freetext category="place" label="Place">South Africa</freetext> <freetext category="date" label="Date">1959</freetext> <freetext category="setName" label="See more items in">Eliot Elisofon Field photographs 1942-1972</freetext> </descriptiveOptional> <indexedStructured> <name>Elisofon, Eliot</name> <object_type>Color slides</object_type> <object_type>Photographs</object_type> <object_type>Archival materials</object_type> <topic>Mod. architecture/cityscape</topic> <topic>Cultural landscapes</topic> <topic>Aerial photography</topic> <place>Africa</place> <place>South Africa</place> <date>1959</date> <online_media_type>Images</online_media_type> </indexedStructured> </doc>

  20. Data Example (abbreviated) – a sculpture <doc boost="6.4"> - <descriptiveNonRepeating> <record_ID>siris_ari_7985</record_ID> <unit_code>ARI</unit_code> <data_source>Art Inventories</data_source> <title_sort>DREXEL MONUMENT SCULPTURE</title_sort> <title label="Title">The Drexel Monument, (sculpture)</title> <record_link>http://siris-artinventories.si.edu/ipac20/ipac.jsp?&profile=all&source=~!siartinventories&uri=full=3100001~!7985~!0#focus</record_link> - <online_mediamediaCount="7"> <media thumbnail="http://sirismm.si.edu/saam/scan3thb/S75004286_1bthb.jpg" type="Images">http://americanart.si.edu/images/1966/1966.47.36_1b.jpg</media> </online_media> </descriptiveNonRepeating> - <descriptiveOptional> <freetext category="dataSource" label="Data Source">Art Inventories</freetext> <freetext category="identifier" label="Control number">IAS 75004286</freetext> <freetext label="sculptor" category="name">Manger, Heinrich b. 1833</freetext> <freetext label="founder" category="name">Chas. F. Heaton</freetext> <freetext category="title" label="title">Francis M. Drexel Monument, (sculpture)</freetext> <freetext category="physicalDescription" label="Physical description">metal: bronze Sculpture: bronze; Base: granite; Fountain basin: concrete</freetext> <freetext category="notes" label="Description">Index of American Sculpture, University of Delaware, 1985</freetext> <freetext category="objectType" label="Type">Sculptures-Fountain</freetext> <freetext category="name" label="Subject">Drexel, Francis M</freetext> <freetext category="place" label="Place">Illinois</freetext> <freetext category="date" label="Date">1881. Cast 1882. Dedicated 1883</freetext> </descriptiveOptional> - <indexedStructured> <name>Manger, Heinrich</name> <name>Chas. F. Heaton</name> <object_type>Sculptures</object_type> <topic>Portrait male</topic> <name>Drexel, Francis M</name> <place>Illinois</place> <date>1880s</date> <online_media_type>Images</online_media_type> </indexedStructured> </doc>

  21. A system is only as good as the data that is in it.

  22. Data mapping for multiple databases (truncated)

  23. Faceted Categories • Determine the most useful facets; more is not better. • Number of unique facets will affect system response time • Smithsonian has 4.6 million unique terms. Among them: • 864,000 names, • 126,000 topics, • 47,000 places, • 139 dates(down from 40,000 before cleanup), • 1,000 types (down from 2,000 before cleanup)

  24. Build the facet terms 650 $a Art $z Africa, North $v Periodicals. <Topic> Art </Topic> <Place> Africa, North </place> <object_type> Periodicals </object_type>

  25. Build the facet terms 655 $a Photographs $y 1840-1860. <type> Photographs </type> <date> 1840s </date> <date> 1850s </date> <date> 1860s </date>

  26. Challenges • Adapting LCSH and AAT terms in a whole new way • Still seeking a good way to use See and See Also reference data • Reduce Data inconsistency in our records for better quality facet terms • Character conversion challenge with MARC8, UNICODE and UTF8

  27. Future plans • Continue to add data from more digital library databases and museum collection databases • Working on National History museum, and American Indian museum. • Complete the implementation of the capability to interact with external applications • Plan to support “American Art and Artist” application • Add new functionality such as my-list, list-sharing, social tagging. • Support more visual displays such as Google map and time slider

  28. A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution www.siris.si.eduwangch@si.edu

More Related