1 / 39

Searching the Web CS3352 Searching the Web

Web search. All queries answered without accessing texts by indices alone. Local copies ... Program that traverses web to send new or update pages to main server ...

medwin
Download Presentation

Searching the Web CS3352 Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Searching the Web

    CS3352

    Slide 2:Searching the Web

    Three forms of searching Specific queries ? encyclopaedia, libraries Exploit hyperlink structure Broad queries ? web directories Web directories: classify web documents by subjects Vague queries ? search engines index portions of web

    Slide 3:Problem with the data

    Distributed data High percentage of volatile data – much is generated Large volume June 2000 Google full-text index of 560 million URLs Unstructured data gifs, pdf etc Redundant data – mirrors (30% pages are near duplicates) Quality of data false, poorly written, invalid, mis-spelt Heterogeneous data – media, formats, languages, alphabets NO point in giving web statistics because they will be out of date the minute I give themNO point in giving web statistics because they will be out of date the minute I give them

    Slide 4:Users and the Web

    How to specify a query? How to interpret answers? Ranking Relevance selection Summary presentations Large document presentation Main purpose: research, leisure, business, education 80% do not modify query 85% look first screen only 64% queries are unique 25% users use single keywords Problem for polysemic words and synonyms

    Slide 5:Web search

    All queries answered without accessing texts – by indices alone Local copies of web pages expensive (Google cache) Remote page access unrealistic Links Link topology, link popularity, link quality, who links Page structure Words in heading > words in text etc Sites Sub collections of documents, mirror site detection Names Presenting summaries Community identification Indexing Refresh rate Similarity engine Ranking scheme Caching and popularity measures

    Slide 6:Spamming

    Most search engines have rules against invisible text, meta tag abuse, heavy repetition "domain spam” overtly submission of "mirror“ sites in an attempt to dominate the listings for particular terms

    Slide 7:Excite Spamming

    Excite screens out spamming before adding a page to its web page index. if it finds a string of words such as: money money money money money money money it will replace the excess repetition, so that essentially, the string becomes: money xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx The more Excite detects unusual repetition, the more heavily it will penalize a page. Excite does not penalize for the use of hidden text, but penalties will apply if hidden text is used to disguise spam content. Excite penalises "domain spam."

    Slide 8:Centralised architecture

    Crawler-indexer (most search engine) Crawler Robot, spider, wanderer, walker, knowbot Program that traverses web to send new or update pages to main server (where they are indexed) Run on local server and send request to remote servers Centralised use of index to answer queries

    Slide 9:Example (AltaVista)

    1998: 20 multi-processor machines 130 GB of RAM, 500 GB disk space

    Slide 10:Distributed architecture

    Harvest: harvest.transarc.com Gatherers: Collect and extract indexing information from one or more web servers at periodic time Brokers Provide indexing mechanism and query interface to data gathered Retrieve information from gatherers or other brokers, updating incrementally their indices

    Slide 11:Harvest architecture

    Slide 12:Ranking algorithms

    Variations of Boolean and vector space model TF ? IDF plus Hyperlinks between pages pages pointed to by a retrieved page pages that point to a retrieved page Popularity: number of hyperlinks to a page Relatedness: number of hyperlinks common in pages or pages referenced by same pages WebQuery PageRank (Google) HITS (Clever) Not very much information intellectual property! Difficult to compare Recall? Not very much information intellectual property! Difficult to compare Recall?

    Slide 13:Lets Use Links!

    ... are scattered over the Internet with little structure, making it difficult for a person in the center of this electronic clutter to find only the information desired. Although this diagram shows just hundreds of pages, the World Wide Web currently contains more than 300 million of them. Nevertheless, an analysis of the way in which certain pages are linked to one another can reveal a hidden order. This is from an article in Scientific American on Clever which implements HITS From http://www.sciam.com/1999/0699issue/0699raghavan.html... are scattered over the Internet with little structure, making it difficult for a person in the center of this electronic clutter to find only the information desired. Although this diagram shows just hundreds of pages, the World Wide Web currently contains more than 300 million of them. Nevertheless, an analysis of the way in which certain pages are linked to one another can reveal a hidden order. This is from an article in Scientific American on Clever which implements HITS From http://www.sciam.com/1999/0699issue/0699raghavan.html

    Slide 14:Metasearch

    Web server that sends query to Several search engines Web directories Databases Collect results Unify them (Data fusion) Aim: better coverage Issues Translation of query Uniform result (fusion rankings, e,g, pages retrieved by several engines) Wrappers

    Slide 15:Google

    The best web engine: comprehensive and relevant results Biggest index 580 million pages visited and recorded Uses link data to get to another 500 million pages Different kinds of index smaller indexes containing a higher amount of the web's most popular pages, as determined by Google's link analysis system. Index refresh Updated monthly/weekly Daily for popular pages Serves queries from three data centres two on West Coast of the US, one on East Coast. The benefit for customers in using a smaller index is savings. It costs them more to query against the biggest collection of documentsThe benefit for customers in using a smaller index is savings. It costs them more to query against the biggest collection of documents

    Slide 16:Google: let this inspire you…

    Larry Page, Co-founder & Chief Executive Officer Sergey Brin, Co-founder & President PhD students at Stanford

    Slide 17:Google Overview

    Crawls the web to create its listings. Combines traditional IR text matching with extremely heavy use of link popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that Google does. PageRank Explained PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query. PageRank Explained PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.

    Slide 18:Citation Importance Ranking

    http://hci.stanford.edu/~page/papers/pagerank/ppframe.htmhttp://hci.stanford.edu/~page/papers/pagerank/ppframe.htm

    Slide 19:Google links

    Submission: Add URL page (no need to do a "deep" submit) Best way to ensure that your site is indexed is to build links. The more other sites are pointing at you, the more likely you will be crawled and ranked well. Crawling and Index Depth: aims to refresh its index on a monthly basis, if Google doesn't actually index a pages, it may still return it in a search because it makes extensive use of the text within hyperlinks. This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed. pages from your site appear in about 4 to 6 weeks. Google has fairly large index, so it should gather a significant number of your pages. pages from your site appear in about 4 to 6 weeks. Google has fairly large index, so it should gather a significant number of your pages.

    Slide 20:Google Relevancy (1)

    Google ranks web pages based on the number, quality and content of links pointing at them (citations). Number of Links All things being equal, a page with more links pointing at it will do better than a page with few or no links to it. Link Quality Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.

    Slide 21:Google Relevancy (2)

    Link Content The text in and around links relates to the page they point at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel ? Ranking boosts on text styles The appearance of terms in bold text, or in header text, or in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.

    Slide 22:PageRank

    Usage simulation & Citation importance ranking: Based on a model of a Web surfer who follows links and makes occasional haphazard jumps, arriving at certain places more frequently than others. User randomly navigates Jumps to random page with probability p Follows a random hyperlink from the page with probability 1-p Never goes back to a previously visited page by following a previously traversed link backwards Google finds a single type of universally important page--intuitively, locations that are heavily visited in a random traversal of the Web's link structure.

    Slide 23:PageRank

    Process modelled by Markov Chain probability of being in each page is computed, p set by the system Wj = PageRank of page j ni = number of outgoing links on page i m is the number of nodes in G, that is, the number of Web pages in the collection Ranking of other pages is normalized by the number of links in the page Computed using an iterative method ? Wj = p + (1- p) (i,j) ? G | i ? j m In practice, for each Web page Google sums the scores of other locations pointing to it. When presented with a specific query, responds by quickly retrieving all pages containing the search text and listing them according to their preordained ranks. (1-p) is a dampening factor = probability at each page the random surfer gets bored and requests another page By taking on board the PageRank of the pages pointing to the page we are ranking, the quality of pages is re-enforced. High ranking generate higher rankings. Corresponds to the principal eigenvector of the normalized link matrix of the web, which is the transition matrix of the Markov chain. It says here. The PageRanks form a probability distribution over web pages so the sum of all the web pages PageRanks will be 1 In practice, for each Web page Google sums the scores of other locations pointing to it. When presented with a specific query, responds by quickly retrieving all pages containing the search text and listing them according to their preordained ranks. (1-p) is a dampening factor = probability at each page the random surfer gets bored and requests another page By taking on board the PageRank of the pages pointing to the page we are ranking, the quality of pages is re-enforced. High ranking generate higher rankings. Corresponds to the principal eigenvector of the normalized link matrix of the web, which is the transition matrix of the Markov chain. It says here. The PageRanks form a probability distribution over web pages so the sum of all the web pages PageRanks will be 1

    Slide 24:PageRank

    Wj Wk 2 Wk 1 Wi 2 Wi 1 Wi 3 (1- p) + +

    Slide 25:Google Content

    Performs a full-text index of the pages it visits. It gathers all visible text. It does not read either the meta keywords or description tags. Descriptions are formed automatically by extracting the most relevant portions of pages. If a page has no description, it is probably because Google has never actually visited it.

    Slide 26:Google Spamming

    Link popularity ranking system leaves it relatively immune to traditional spamming techniques. Goes beyond the text on pages to decide how good they are. No links, low rank. Common spam idea create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites. Unlikely to work, do real link building instead with non-competitive sites that are related to yours.

    Slide 27:Site identification

    Slide 28:AltaVista

    Slide 29:HITS: Hypertext Induced Topic Search

    The ranking scheme depends on the query Considers the set of pages that point to or are pointed at by pages in the answer S Implemented in IBM;s Clever Prototype Scientific American Article: http://www.sciam.com/1999/0699issue/0699raghavan.html Authorities -- should have relevant content Hubs: should point to similar contentAuthorities -- should have relevant content Hubs: should point to similar content

    Slide 30:HITS (2)

    Authorities: Pages that have many links point to them in S Hub: pages that have many outgoing links Positive two-way feedback: better authority pages come from incoming edges from good hubs better hub pages come from outgoing edges to good authorities

    Slide 31:Authorities and Hubs

    Authorities ( blue ) Hubs (red) ... help to organize information on the Web, however informally and inadvertently. Authorities ( blue ) are sites that other Web pages happen to link to frequently on a particular topic. For the subject of human rights, for instance, the home page of Amnesty International might be one such location. Hubs ( red ) are sites that tend to cite many of those authorities, perhaps in a resource list or in a "My Favorite Links" section on a personal home page. ... help to organize information on the Web, however informally and inadvertently. Authorities ( blue ) are sites that other Web pages happen to link to frequently on a particular topic. For the subject of human rights, for instance, the home page of Amnesty International might be one such location. Hubs ( red ) are sites that tend to cite many of those authorities, perhaps in a resource list or in a "My Favorite Links" section on a personal home page.

    Slide 32:HITS two step iterative process

    assigns initial scores to candidate hubs and authorities on a particular topic in set of pages S use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities use the updated hub information to refine the guesses about the authorities--determine where the best hubs point most heavily and call these the good authorities. Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can then be used to determine the best authorities and hubs. ? H(p) = u ? S | p ? u A(u) A(p) = v ? S | v ? p H(u) ? Jon M. Kleinberg's "Authoritative Sources in a Hyperlinked Environment" in Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SIAM/ACM-SIGACT, 1998). Clever: Initially, pick up a set from a standard text index such as AltaVista. The system then augments these by adding all pages that link to and from that 200. In our experience, the resulting collection, called the root set, will typically contain between 1,000 and 5,000 pages. To start off, we look at a set of candidate pages about a particular topic, and for each one we make our best guess about how good a hub it is and how good an authority it is. We then use these initial estimates to jump-start a two-step iterative process. First, we use the current guesses about the authorities to improve the estimates of hubs—we locate all the best authorities, see which pages point to them and call those locations good hubs. Second, we take the updated hub information to refine our guesses about the authorities--we determine where the best hubs point most heavily and call these the good authorities. Repeating these steps several times fine-tunes the results. converJon M. Kleinberg's "Authoritative Sources in a Hyperlinked Environment" in Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SIAM/ACM-SIGACT, 1998). Clever: Initially, pick up a set from a standard text index such as AltaVista. The system then augments these by adding all pages that link to and from that 200. In our experience, the resulting collection, called the root set, will typically contain between 1,000 and 5,000 pages. To start off, we look at a set of candidate pages about a particular topic, and for each one we make our best guess about how good a hub it is and how good an authority it is. We then use these initial estimates to jump-start a two-step iterative process. First, we use the current guesses about the authorities to improve the estimates of hubs—we locate all the best authorities, see which pages point to them and call those locations good hubs. Second, we take the updated hub information to refine our guesses about the authorities--we determine where the best hubs point most heavily and call these the good authorities. Repeating these steps several times fine-tunes the results. conver

    Slide 33:HITS issues (3)

    Restrict set of pages to a maximum number Doesn’t work with non-existent, repeated or automatically generated links Weighting links on surrounding content Diffusion of the topic A more general topic contains the original answer Analyse the content of each page and score that, combining link weight with page score. Sub-grouping links HITS Used for web community identification

    Slide 34:Cybercommunities

    Slide 35:Google vs Clever

    Google assigns initial rankings and retains them independently of any queries -- enables faster response. looks only in the forward direction, from link to link. Clever assembles a different root set for each search term and then prioritizes those pages in the context of that particular query. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.

    Slide 36:Autonomy

    High performance pattern matching based on Bayesian inference networks Identifies patterns in text based on usage and term frequency that correspond to concepts X% probability that a document is about a subject Encode the signature Categorize it Link it to related documents with same signature Not just search engine tool.

    Slide 37:Human powered searching

    Ask Jeeves an answer service using human editors to build the knowledgebase. Editors proactively suggest questions and watch what people are actually searching for. Yahoo uses humans to organize the web. Human editors find sites or review submissions, then place those sites in one or more "categories" that are relevant to them.

    Slide 38:Research Issues

    Modelling Querying Distributed architecture Ranking Indexing Dynamic pages Browsing User interface Duplicated data Multimedia Don’t have to index everything – just popular pages.Don’t have to index everything – just popular pages.

    Slide 39:Further reading

    http://searchenginewatch.com/ http://www.clpgh.org/clp/Libraries/search.html

More Related