520 likes | 628 Views
Creating a Path through the Labyrinth. Using Sitemaps to Enhance Discoverability. Sandra McIntyre, Mountain West Digital Library Anne Morrow, University of Utah Patrick OBrien, University of Utah (volunteer) Lisa Chaufty, University of Utah. The Importance of Discoverability.
E N D
Creating a Path through the Labyrinth Using Sitemaps to Enhance Discoverability Sandra McIntyre, Mountain West Digital LibraryAnne Morrow, University of UtahPatrick OBrien, University of Utah (volunteer)Lisa Chaufty, University of Utah Western CONTENTdm Users Group
Why is this a priority? • For our digital collections • The J. Willard Marriott Library hosts over 100 outstanding digital collections, containing over 1 million digital items. • For USpace, one of our large collections • Our mission is to collect, preserve, and provide access to the intellectual capital of the University of Utah, to reflect the University’s excellence, and to share that work with others.
Sharing • Share what? • Unique materials, such as theses and dissertations • Above all, the work of our faculty • Q: Why do faculty submit to institutional repositories? • A: One primary motivator for faculty IR contribution is to make sure other scholars can find and cite their work. • A: Faculty will access and contribute to an IR if they see significant input activity. • How are people going to find what we want to share?
De Rosa, C., et al. Perception of Libraries and Information Resources. OCLC Membership Report, 1-17. http://www.oclc.org/reports/pdfs/Percept_all.pdf
De Rosa, C., et al. Perception of Libraries and Information Resources. OCLC Membership Report, 1-18. http://www.oclc.org/reports/pdfs/Percept_all.pdf
User discovery • Library Website • OAI harvesters (e.g., Mountain West Digital Library) • WorldCat • Search engines • Google • Google Scholar • Google Images • Yahoo • Bing • Baidu • And more…
Focus on search engine discoverability • Googlebots • Crawls website and harvests links • Used by Google to build search index • We want to lend the bots a hand • What can we do to improve the bot-crawling of the digital library? • Understanding the relationship between Googlebots and CONTENTdm
Cross-departmental collaboration • Search Engine Optimization (SEO) A-Team • Collection Managers • Lisa Chaufty, Coordinator of the Institutional Repository • Mountain West Digital Library • Sandra McIntyre, MWDL’s Program Director • IT Division • Kenning Arlitsch, Associate Director for IT Services • Systems Development • Application Development • Anne Morrow, Digital Initiatives Librarian • Search Engine Marketing (SEM) expertise • Patrick OBrien, Volunteer Advisor
Google and CONTENTdm’s OAI • Big change in mid-2008:Google announced it would no longer crawl Open Archives Initiative (OAI) streams • Before then, Google would crawl OAI metadata for digital collections and index it • Many digital collections have been slowly “disappearing” from Google since then
Index and store data Web Crawlers – how they work
Dynamic pages • CONTENTdm constructs pages in HTML on the fly • Header • Record retrieved from database and formatted • Footer
Dynamic page • Have to tell crawler how to assemble it (with URL)
CONTENTdm: More challenges for crawlers • Compound objects use frameset that makes it harder to get to the content of the page • Table layout, Javascript, and CSS clutter up the code and are not used for structure of text, just for styling and management • Except for splash page, there is no hierarchy of linking that gives any context – “islands” of separate pages • Usable content is often far down in the code on the delivered page
In the works • OCLC is working with Google and others to enhance visibility of resources in WorldCat • Expect more of a linkage between WorldCat and search engines
Google Sitemaps • Instructions to Google’s web crawler: Crawl these URLs to get my content • One Sitemap for each collection • Not the same as a “site map” (contents page for website) “Here is a list of the URLs of the dynamic pages that I want you to crawl, one for each item.”
Google Sitemap – example http://content.lib.utah.edu/sitemaps/sitemap_ir-main-001.xml
Sitemap Index • XML file listing all the Sitemaps on your server “Here is a list of all the Sitemap files.”
Sitemap Index - example http://content.lib.utah.edu/cdm4/autositemap/sitemapindex.xml
Implementing Google Sitemaps • Create Sitemaps, one for each collection, and Sitemap Index. • Register with Google Webmaster Tools. • Inform Google about the location of your Sitemap Index. • In Webmaster Tools • In the robots.txt file on the server • Monitor crawler results.
Step 1: Create Sitemaps and Index • According to the protocol at http://www.sitemaps.org: • Create a Sitemap file for each collection. • Create a Sitemap Index file. • See Terry Reese’s “makemap” script athttp://digitalcollections.library.oregonstate.edu/php/makemap.txt
Step 2: Webmaster Tools Registration • Register (free) with Google Webmaster Tools at http://www.google.com/webmasters/tools • You will need a Google account • Follow the directionsto prove your controlover the site by adding a <meta> tag to the home page
Step 3: Inform Google • Step 3A: Submit the address of Sitemap Index file on Webmaster Tools.
Step 3: Inform Google • Step 3B: Modify the robots.txt file at the root of your CONTENTdm server to specify the location of the Sitemaps Index.
Step 4: Monitor crawler results • Monitor crawler results on Webmaster Tools. • Top search queries • Links to your site • Keywords • Internal links • Crawl errors • Crawl stats • HTML suggestions
Using and Maintaining Sitemaps • Re-generating (updating) Sitemaps and Sitemap Index frequently • Checking the crawler stats in Google Webmaster Tools and initiating changes as needed • Noticing the impact in Google searches
Initial approach to SEO • Generated Sitemaps and Sitemap Index file: Applied a variation of Terry Reese’s makemap.php script • Edited robots.txt file • Registered with Webmaster Tools • Started observing crawler statistics
The Next Phase • Enlisted expertise of Patrick OBrien • Sitemaps • Diagnose crawl error reports • Make recommendations • Proposing strategy for the future of SEO
Know your customers and what they value. Faculty • Publication Page Views • Publication Downloads • Requests for Information • Publication Citations Value Value High High Collection Donors Digital Collection Pages Indexed Digital Collection Page Views Digital Collection Visitors Requests for More Info Physical Collection Visitors Reproductions Ordered
Why can’t the public find our content? Public CONTENTdm ? ? ? ? What do they value? • Are you worthy enough for their customer (i.e Index)? • How much will their customer value the introduction (i.e, Visibility)?
Are you worthy enough for their customer? Can they trust you with their customer? Is your content worth an investment of their resources?
Check the Crawl Errors Page Forbidden (401 errors) User Not Authorized (403 errors) Network Unreachable (5xx errors) Page Not Found (404 errors)
Address errors and don’t leave users stranded! Low Trust Example 403 Error
Eliminate sitemap and robots.txt conflicts Robots.txt Sitemap User-agent: * Disallow: /dmscripts/ Disallow: /cdm4/admin/ Disallow: /cdm4/client/ Disallow: /cdm4/cqr/ Disallow: /cdm4/images/ Disallow: /cdm4/includes/ Disallow: /cdm4/jscripts/ Disallow: /cdm-diagnostics/ Disallow: /cgi-bin/ Disallow: /images/ Disallow: /u/ User-agent: * Disallow: /dmscripts/ Disallow: /cdm4/admin/ Disallow: /cdm4/client/ Disallow: /cdm4/cqr/ Disallow: /cdm4/images/ Disallow: /cdm4/includes/ Disallow: /cdm4/jscripts/ Disallow: /cdm-diagnostics/ Disallow: /cgi-bin/ Disallow: /images/ Disallow: /u/ http://content.lib.utah.edu/cgi-bin/browseresults.exe?CISOROOT=/DC_Beckwith http://content.lib.utah.edu/cgi-bin/browseresults.exe?CISOROOT=/DC_Beckwith
Provide sitemaps linking context with simple URLs http://content.lib.utah.edu/ http://content.lib.utah.edu/cdm4/az.php#D http://content.lib.utah.edu/cdm4/az_details.php?id=44 http://content.lib.utah.edu/cdm4/browse.php?CISOROOT=/DardHunter http://content.lib.utah.edu/cdm4/document.php?CISOROOT=/DardHunter&CISOPTR=1919
Increase Page Crawl efficiency A Papermaking Pilgrimage to Japan, Korea and China
Are you worthy enough for their customer? • Can they trust you with their customer? • Check the Crawl Errors in Google Webmaster • Address errors and don’t leave their customers stranded! • Is your content worth an investment of their resources? • Eliminate sitemap & robots.txt conflicts • Provide sitemaps linking context with simple URLs • Increase Page Crawl efficiency
How much will their customer value the introduction (i.e., Visibility)? Is your content relevant? Is your content credible? Is your content accessible?
Recommendations for CONTENTdm managers • Assemble the right players: your SEO team • Set your priorities • Create linking strategy from home page, to index of collections, to collection “splash” page, and to item pages • Create Sitemaps and Sitemap Index file • Set up a regular process to update Sitemaps • Eliminate any conflicts between robots.txt and Sitemaps • Get involved with CONTENTdm’s new Web templates • Set up to monitor results!