160 likes | 282 Views
A Mini Experiment. Win Shih, John Pardavila, Krishna Rayavaram University at Albany, SUNY LiSUG Conference, October 12, 2009. Project Overview. Scope Resources Time. Content Acquisition. Crawls 220 file types File system crawling Direct connection to databases Content feed API.
E N D
A Mini Experiment Win Shih, John Pardavila, Krishna Rayavaram University at Albany, SUNY LiSUG Conference, October 12, 2009 2009 LiSUG Conference
Project Overview • Scope • Resources • Time 2009 LiSUG Conference
Content Acquisition • Crawls 220 file types • File system crawling • Direct connection to databases • Content feed API 2009 LiSUG Conference
Query Processing • Google Algorithm • Keymatch • Self-learning spell checker • Suggested Queries • Language support • Google Stemming 2009 LiSUG Conference
Results Display • Google Standard • Templates/Wizard • Output in XML 2009 LiSUG Conference
Sample Mini Libraries • Denver Public Library (http://denverlibrary.org/) • University of Colorado Health Sciences Library (http://hsclibrary.uchsc.edu/) • New York State Archives (http://www.archives.nysed.gov/aindex.shtml) • Combined Arms Research Library (http://cgsc.leavenworth.army.mil/carl/) 2009 LiSUG Conference
Albany Student Press • 1919-2009: Celebrating 90 years of service • Size: 2,288 PDF files • Coverage: 1916 - 1985 2009 LiSUG Conference
Google Mini Features • Crawl URLs • Collections • Front Ends • Key Matching, Related Queries, Result Biasing • Status and Reports • Search Reports, Logs, and Events • Server Administration • Networks, Accounts, Notifications, SHH, more 2009 LiSUG Conference
Technologies Used • XML, XSLT, XSL for the interface. • You do not need a coder to generate results! -Wizard vs. Coding. • The libraries front ends pointing to the Google Mini are developed using combinations of PHP, JavaScript, XHTML, and CSS in the Drupal Content Management System 2009 LiSUG Conference
Mini ASP Search • Demonstrating the Search • Strings (Harvey Milk) • By Date (Specific Date, Month & Year ) 2009 LiSUG Conference
Lessons Learned • Quality of OCR Scan • Incorrect character recognition affects accuracy on search results. • Non-OCRed documents (Google Mini will not be able to index PDF image) • Metadata – Search Engine Optimization • Metadata is a good mechanism to improve the visibility of a posted web page in search engine results. • Can enhance the search ranking and results of PDF files. • None of the PDFs contain metadata. Added metadata to Title, Subject and Keywords attributes 2009 LiSUG Conference
Lessons Learned • File naming convention • Google Mini does index file names. • ASP Files named in this format: yyyy_mm_dd • Granularity • PDF files at issue level, instead of article level, is not granular enough and will affect the search experience. • In a keyword search, search terms can appear in several articles within the same issue. However, there will be only one result entry in Google search result listing. Patrons have to use Adobe Reader search function to locate the appearance of the search term. 2009 LiSUG Conference
Lessons Learned • Clustering • Automatic Filtering • Add parameter “filter=0” • Proxyreload • Shows updated XSL stylesheet preview rather than wait for 15 minutes set by XSLT server. • Add paramter “proxyreload=1” • Image Quality • Some of the scanned images are quite light and it might affect the quality of OCR. It can also be difficult to read. 2009 LiSUG Conference
Lessons Learned • Ranking of Results Most of the time the ranking of the results is the 1st link in the result page if the string is indexed properly. • Word Spacing If the text or information in the issues have more than one space between words, Mini doesn’t seem to index or show accurate result. 2009 LiSUG Conference
Future Plans • Rescan the whole collection • Explore other products, including open source discovery tools • Raise funds for expansion and sustainability of the project • Continuing the collaboration • ‘Beta’ testing other digitized collections • Incorporate user feedback 2009 LiSUG Conference