120 likes | 231 Views
CSC 96 Building and Managing Web Sites with Microsoft Technologies. Week 9 Search Engines and Microsoft Index Server. Search Engines. Search engines are important for web sites larger than 100 pages. However, should not be a replacement for a good site structure and navigation scheme.
E N D
CSC 96Building and Managing Web Sites with Microsoft Technologies Week 9 Search Engines and Microsoft Index Server CSC96B
Search Engines • Search engines are important for web sites larger than 100 pages. • However, should not be a replacement for a good site structure and navigation scheme. • Provides alternative content discovery mechanism for advanced users. • Search engines create entries automatically. CSC96B
How Search Engines Work • Search engines are an information gathering and filtering subsystem. • Robots/Spiders gather data from remote/local web repositories to local indexing database. • Information conversion and extraction • Search engines periodically review records to update database. CSC96B
Formatting Pages for Search Engines • Use short and descriptive TITLE element. • <TITLE> should be first element of <HEAD> section. • Use META description element to provide abstract of page. • Break up content into more smaller pages for more precise searching. • Use META keywords element. Most search engines limit to first 25 words. • Ranking algorithms typically based on keyword frequency and location on page. CSC96B
Robots Exclusion Standard • Most search engines are look for a text file called robots.txt in your site's root directory. • Robots.txt tells robots/spiders what they can and can't index. • Most but not all robots abide by this standard. • Only one robots.txt file per web ite -- any others ignored. • Wild cards not supported. Truncate path instead (e.g., /help disallows both /help.html and /help/index.html) • Notes indicated with # CSC96B
Sample Robots.txt File # Test robot.txt file # this section restricts /temp and /current to all agentsUser-agent: * #applies to all robots Disallow: /temp/ #restrict /tempDisallow: /current/ #restrict /currentAllow: /current/allow.htm # restricts BadSpider from all contentDisallow:User-agent: BadSpiderDisallow: / #BadSpider restricted CSC96B
How to Identify Visiting Spiders • Check server logs for sites that retrieve many documents, especially in a short time. • If your server supports User-agent logging, check for retrievals with unusual User-agent header values. • Look for sites that repeatedly check for the file '/robots.txt' CSC96B
Robots META Element • Can direct Robots at the page level using the Robots META tag. • No server administrator action is required. • Only some robots implement this. <HTML><Head><Title>Robots Test Page</Title><META name="robots" content="noindex,nofollow"></Head><Body> ... CSC96B
Robots Information For more information: General Resource: http://wdvl.internet.com/Location/Search/Robots.html Directory of Robots: http://info.webcrawler.com/mak/projects/robots/active.html CSC96B
Microsoft Index Server • Excellent indexing server packaged free with IIS/NT 4. • Use only Version 2.0 with NT4 Option pack. • Once installed, runs automatically with virtually no attention required. • Spins through content, indexing all words in the document. • Can create multiple indexes for different web and/or portions of webs. • Use IIS to turn off indexing of specific directories. • Indexing occurs during less busy times. • Occasionally will need to rebuild the index(s). CSC96B
Using Index Server • Three different methods to use Index Server: • Forms using .htx and .idq files • Use ASP pages to access index contents using supplied Index Server objects. • Used ASP pages and ADO to access index contents with SQL statements. • Basic forms are easiest, but ASP pages provide the most power. CSC96B
Accessing Index Server with Forms • Create a search form that references an IDQ parameters file. • Create an .IDQ file that passes information to Index Server, including the output template file (.HTX). • Create an .HTX file to output the format. CSC96B