350 likes | 516 Views
Content Rules Again: Evolution of the NREL Search Engine And Search Engine Services. InterLab 2002 12/5/02 Marsha Luevane National Renewable Energy Laboratory. What this presentation covers. Introduction Twelve-step evolution of the NREL search engine and search engine services
E N D
Content Rules Again: Evolution of the NREL Search Engine And Search Engine Services InterLab 2002 12/5/02 Marsha Luevane National Renewable Energy Laboratory
What this presentation covers • Introduction • Twelve-step evolution of the NREL search engine and search engine services • What’s next
1: Commitment • NREL used the Harvest search engine until 1997 • Harvest had a lot of problems • We needed a better search engine • We were ready to commit the time and money to research, evaluate, and implement a good search engine • We used Excite in the interim
2: Research and evaluation • NREL developed a list of criteria • We researched which search engines were available that met the criteria • We evaluated Verity and Infoseek • We selected and implemented Infoseek Ultraseek • For details, see our InterLab ’98 presentation “After the Harvest, Getting Excited About Infoseek”
3: Collection development • We needed Ultraseek to collect and index content from three mothership sites • SOURCE intranet - thesource.nrel.gov • NREL - www.nrel.gov • EREN - www.eren.doe.gov • SOURCE and NREL were easy because most of the content lives on two servers
3: Collection development (cont’d) • Energy Efficiency and Renewable Energy Network (EREN) was a different matter • EREN is the official site for DOE Office of Energy Efficiency and Renewable Energy • Integrates information from Web sites at NREL, other labs, and DOE • It is also a portal for information on energy efficiency and renewable energy technologies • Includes information from government sites, state energy offices, universities, trade associations, research organizations, etc. • EREN includes content from >600 Web sites
3: Collection development (cont’d) • EREN was a different matter (cont’d) • Portal complicates things • Portal indexes content from hundreds of sites, most of which are produced outside of NREL • There are lots of content challenges and surprises • We have filters on some non-DOE content to keep costs down
4: Optimizing content for Ultraseek • Ultraseek collected and indexed 75K documents • For the first time, we got a good look at our content • We needed more descriptive titles and summaries for search results • We made “meaningful and unique” titles standard on SOURCE, NREL, EREN • We developed optimizing guidelines
4: Optimizing content for Ultraseek (cont’d) • Basic optimizing guidelines • Focus on your content • Determine key terms that describe the “aboutness” of the document • Think about terms that people use in searches
4: Optimizing content for Ultraseek (cont’d) • Basic optimizing guidelines (cont’d) • Position key terms in headers, beginning text, and throughout your content • Position key terms in titles and make sure titles describe the content • Meta tag your home pages • Meta tags for other important pages are optional
5: Optimizing content for Web-wide search engines • We also wanted our pages to rank high and display well in Web-wide search engines • Techniques we used to optimize pages for Ultraseek work well in Web-wide search engines • We were getting lots of traffic from Web-wide search engines, and started monitoring and reporting our page rankings • We do searches on terms related to key content • We report to site managers how their pages fare in searches
6: Content classification • By 2000, EREN portal was so large – over 80K documents – that we needed a better way to get users into content • We implemented the Content Classification Engine (CCE), an add-on to Ultraseek that helps organize content into browsable topics • We spent a year developing 850 topics for eleven energy efficiency and renewable energy technologies
6: Content classification (cont’d) • Content classification teams and tasks • Content team – NREL science writers • Researched technologies; determined topics; wrote text and technology scope notes; coordinated NREL and DOE topic reviews • CCE team – me • Consulted on topics; developed topic structure; created, tested, and edited topic rules (>3K searches); reviewed topic results
6: Content classification (cont’d) • Topic considerations • We developed topics for several audiences, including energy professionals, homeowners, and students • We listed topics on technology pages, not on a search page • Energy professionals know the terminology but homeowners, students, and other users don’t • We want technology pages to educate users as well as guide them to content
6: Content classification (cont’d) • Benefits • EREN content is organized, so users don’t have to know the terminology • Users learn about energy technologies from topics and scope notes • Topics get users into our content • Topics bring lots of traffic to our sites • Topics give site managers a good look at their content
6: Content classification (cont’d) • Benefits (cont’d) • Site managers can use statistics on topic usage for content development • Webmasters find topics useful for responding to user inquiries • EREN Webmaster inquiries have decreased because users need less help finding information
6: Content classification (cont’d) • Benefits (cont’d) • Need for optimization is reinforced • Optimized pages rank higher and display better in topic search results • Site managers who have optimized their pages see the rewards • Site managers who have not optimized see how their pages fare in searches and understand better why they need to optimize
7: Optimizing audits • By 2001, Ultraseek (now called Inktomi) was indexing >100K documents for SOURCE, NREL, and EREN • Web wide-search engines were indexing our content and billions of other documents • Content optimization was more important than ever before • To help site managers know where to concentrate their optimizing efforts, we developed optimizing audits
7: Optimizing audits (cont’d) • Basic optimizing audits • How many documents are on your site • What are your document formats • What content should you focus on • How do your pages fare out of context • Do your pages have descriptive titles • Does beginning text on pages tell users what the page is about
8: Optimizing PDFs and native documents • We learned from optimizing audits that we had lots of important content in PDFs and Word, Excel,and PowerPoint documents • Our search engine indexed these formats but most results were not pretty • We researched how Inktomi indexes and displays these formats, then developed optimizing standards and guidelines
8: Optimizing PDFs and native documents (cont’d) • PDFs • Titles • Serve as captions for search results • Should contain key terms and describe content • Boost ranking if they contain search terms • Are required per SOURCE, NREL, and EREN standards • Subjects • Display as result summaries in Inktomi • Must be at least 71 characters or won’t display • Should contain key terms and describe content • Boost ranking in Inktomi if they contain search terms • Are required per standards • Authors and keywords are optional
8: Optimizing PDFs and native documents (cont’d) • Word, Excel, and PowerPoint documents • Titles • Serve as captions for search results • Should contain key terms and describe content • Boost ranking if they contain search terms • Are required per SOURCE standards and recommended practices for NREL and EREN • Subjects • Display as result summaries in Inktomi • Should contain key terms and describe content • Boost ranking in Inktomi if they contain search terms • Are required per SOURCE standards and recommended practices for NREL and EREN • Authors and keywords are optional
9: Dynamic URL conversion • We created a lot of dynamic Web pages using ColdFusion • Our search engine can index dynamic URLs • Web-wide search engines either don’t index dynamic URLs, want sites to individually submit dynamic URLs, or limit the number of dynamic URLs they index • Best solution is to convert URLs to a format that all search engines can index • Before: www.oit.doe.gov/cfm/fullarticle.cfm?id=355 • After: www.oit.doe.gov/cfm/fullarticle.cfm/id=355
10: Optimizing classes • We had a great deal of optimizing information to share • We went from doing lots of one-on-one consulting to offering formal, regular classes • We offer entire classes on optimizing content, titles, and PDFs • We cover the basics of optimizing Word, Excel, and PowerPoint documents in the content and title classes
11: Search log and statistics analysis • Search logs and site statistics provide great information about user interests, and we started offering analysis services • Search log analysis • Inktomi automatically creates query logs • We developed a basic tool to manipulate query information • We tell site managers how people are using their search features • How people search • What terms people use in searches • What topics people search for
11: Search log and statistics analysis (cont’d) • Site statistics analysis • WebTrends provides information on popular documents, popular paths, search engine referrals, search terms, etc. • We tell site managers • What topics are covered in their popular documents • What paths people follow to their popular documents • What terms people use in Web-wide search engines • What topics people search for in Web-wide search engines • Which Web-wide search engines send the most traffic to their sites
11: Search log and statistics analysis (cont’d) • We correlate the analyses and tell site managers • What topics are popular in site searches and Web-wide searches • Which topics are popular in site searches but not Web-wide searches (and vice versa) • What terms people use in searches • What relationships exist between popular topics and popular documents • We also do searches on popular topics to see what results (documents) people get
11: Search log and statistics analysis (cont’d) • Some of the things we have learned • Most people do basic searches using 1-3 terms • Searchers rarely use advanced techniques • On some sites, users search for general info more frequently than technical info • Documents that rank high in popular searches are also popular documents in statistics • Google sends a lot of traffic to our sites (from pages that rank high in search results) • On SOURCE and NREL, people use Web search to look for employee information, publications, photos, etc.
11: Search log and statistics analysis (cont’d) • How we have applied information from analyses • We use information on search terms and popular topics for content development, optimization, home page redesigns, etc. • We list popular topics on search pages • On SOURCE and NREL search pages, we make it clear where to search Web sites, employee information, publications, photos, etc.
12: De-optimizing content • After studying search logs and statistics, we have de-optimized some content • For example, old NREL fact sheets on renewable energy that ranked high in searches and were popular documents • We removed the fact sheets • We targeted other content, i.e., we optimized other content that we want to rank high in searches and bring people to the NREL site
What’s next • Evolution of NREL search engine and search services will continue • New Inktomi features and products will drive new services • Search and optimization for an enterprise portal? • Search and optimization integrated with a content management system? • Search personalization – search my email or hard drive?