320 likes | 1.45k Views
Web indexing. ICE0534 – Web-based Software Development July 21. 2005 Seonah Lee. Contents. News related to Web Indexing Web Indexing? Web Indexing: Styles Web Indexing: Tools Web Indexing in Search Engine Web Indexing in Google Summary References Question.
E N D
Web indexing ICE0534 – Web-based Software Development July 21. 2005 Seonah Lee
Contents • News related to Web Indexing • Web Indexing? • Web Indexing: Styles • Web Indexing: Tools • Web Indexing in Search Engine • Web Indexing in Google • Summary • References • Question
Google tests tool to aid Web indexing By Dawn Kawamoto, CNET News.com, Monday , June 06 2005 12:00 AM
Web Indexing? • Creating indexes for • individual web sites • Intranets • collections of HTML documents • collections of web sites. • Purpose for • helping users find information using a variety of keywords and gathering similar information.
Web Indexing? • Indexes • systematically arranged items • entry points to go directly to desired information within a larger document or set of documents • Indexing • an analytic process of determining which concepts are worth indexing, what entry labels to use, and how to arrange the entries.
Web Indexing: Styles (1/2) • Back-of-the-Book Style Web Indexing • Including “A-Z indexes” to websites or an Intranet • Some web indexes take the form of a list of hierarchical categories arranged in alphabetical order
Web Indexing: Styles (2/2) • Metadata and Web Indexing • assigning keywords or phrases to web pages or web sites within a meta-tag field • so that the web page or web site can be retrieved with a search engine that is customized to search the keywords field.
Web Indexing: The Most Famous Tool • HTML Indexer, by Brown Inc. • http://www.html-indexer.com/index.html
Web Indexing in Search Engine • Phases of work of Web SE • Document gathering • Document indexing • Searching in response to a query • Visualization of search results The Web Parse Gathering Query Indexing Rank or Match Visualization
Web Indexing in Search Engine • Almost every Web Search Engine uses a slightly different technique • The parsing discards some html marking • Some give different weight to terms in different html field • Some do not index the full text of the document, but only part of it • Some make full use of “metadata” • Very few make use of the information provided by linking: HITS and PageRank (Google)
Web Indexing in Google • PageRank • Google assigns a number called the PageRank to every web page that it knows about. • Assumption: A page is important if other important web pages link to it • Each Page = Node • Directed Edge = a link from one to the other Main Page Google This Page Yahoo
Web Indexing in Google • PageRank: Example Assumption: an average page has a PageRank of 1 R2 R2: 0.6 R1 R1: 1.2 R3 R3: 1.2 • R1 = R3 • R2 = R1 / 2 • R3 = R1 / 2 + R2 • R1 = 2R1 • R3 = R1 • 3 = R1 + R2 + R3
Web Indexing in Google • HITS (Hyperlink-Induced Topic Search) • Divides pages relating to a topic into two groups • Authorities: pages with good content about a topic • Hubs: pages that link to many authority pages on a topic (directory) • Iteratively calculate hub and authority scores for each page in neighborhood and rank results accordingly • Document that many pages point to is a good authority • Document that points to many authorities is a good hub, pointing to many good authorities makes for an even better hub
Summary • Web Indexing • Web Indexing Styles • Back-of-the-Book Style Web Indexing • Metadata and Web Indexing • Web Indexing Techniques in Google • HITS • PageRank
References • News • http://news.com.com/2100-1032_3-5730744.html • Definition • http://www.marisol.com/websiteindexing.html • http://taxonomist.tripod.com/indexing/paperless.html • http://en.wikipedia.org/wiki/Web_indexing • Tools • www.stcsig.org/idx/articles/webindexing.pdf • Theory • http://amath.colorado.edu/outreach/demos/hshi/2001Sum/pagerank.html • http://www.cis.strath.ac.uk/~fabioc/04-mia/lects/11.pdf