Searching the Web

Searching the Web Mark Levene (Follow the links to learn more!)

Mechanics of a Typical Search Query submitted to Google

Mechanics of a Typical Search Google results for the query

Search Engines as Information Gatekeepers • Search engines are becoming the primary entry point for discovering web pages. • Ranking of web pages influences which pages users will view. • Exclusion of a site from search engines will cut off the site from its intended audience. • The privacy policy of a search engine is important.

Search Engine Wars • The battle for domination of the web search space is heating up! • The competition is good news for users! • The way in which advertising is combined with search results is crucial! • There are serious implications if one of the search engines will manage to dominate the space!

Google • Verb “google” has become synonymous with searching for information on the web. • Has raised the bar on search quality, • Has been the most popular search engine in the last few years. • Had a very successful IPO in August 2004. • Is innovative and dynamic.

Yahoo! • Synonymous with the dot-com boom, probably the best known brand on the web. • Started off as a web directory service. • Has very strong advertising and e-commerce partnerships. • Acquired leading search engine technology in 2003.

MSN Search • Synonymous with PC software. • Remember its victory in the browser wars with Netscape. • Developed its own search engine technology only recently, officially launched in Feb. 2005. • May link web search into its next version of Windows.

Others • Ask Jeeves • Specialises in natural language question answering. • Search driven by Teoma. • Looksmart • Has its own directory service. • Search driven by Wisenut. • …

Statistics from search engine logs

Experiment with search engine query syntax • Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. • “+chess” in a query means the user insists that “chess” be present in all hits. • “computer OR chess” means either keywords must be present in all hits. • “”computer chess”” means that the phrase “computer chess” must be present in all hits.

The most popular search keywords

Search Engine Architecture

Crawler Algorithm • A crawler is a program that traverses web pages, downloads them for indexing and follows (or harvests) the hyperlinks on the downloaded pages. • A crawler will typically start from a multitude of web pages and aims to cover as much of the indexable web as possible. • Standard algorithm used breadth-first strategy. • Focused crawlers use best-first strategy.

Search Index - Inverted File • Also store position of word in web page and info. on HTML structure.

The query engine • The interface between the search index, the user and the web. • Algorithmic details of commercial search engines kept as trade secrets. • First step is retrieval of potential results from the index. • Second step is the ranking of the results based on their “relevance” to the query.

Vector Space Model –Content Relevance

Term Frequency (TF) • Count number of occurrences of each term. • Bag of words approach • Ignore stopwords such as is, a, of, the, … • Stemming - computer is replaced by comput, as are its variants: computers, computingcomputation,computer and computed. • Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. is a chess game game computer chess programming chess

Inverse Document Frequency (IDF) • N is number of documents in the corpus. • ni is number of docs in which word i appears. • Log dampens the effect of IDF. • IDF is also number of bits to represent the term.

Ranking with TF-IDF • j – refers to document j • i – refers to word (or term) i in doc j • q – is the query which is a sequence of terms • scorej -is the score for document j given q • Rank results according to the scoring function.

Content Relevance • Phrase matching. • Synonyms. • URL analysis. • Date last updated. • Spell checking. • Home page detection.

Link Text (Anchor Text) • Include link text for a link pointing to a web page, say P, as part of the content of P • Link text is very useful in finding home pages. • Link text behaves like user queries • They act as short summaries • They often match query terms

HTML Weighting • Normal retrieval = (111101) ranking with TF-IDF • (181882) – 39.6% improvement. • (181782) – 48.3% improvement – C2, C4 and C5. • (181582) - 43.5% improvement • Meta tag text is mostly ignored by search engines

Factor in Link Metrics • Multilply by PageRank of document (web page). • We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

Popularity Based Metrics • Factor in users’ opinions as represented in the query logs. • Document space modification adjusts the weights of keywords in popular pages. • Clickthrough data can also be taken into account to improve the ranking of search engine query results.

Precision and Recall • Precision is Overlap/Retrieved (first results page retrieved is most important). • Recall is Overlap/Relevant (for web search recall is related to index coverage).

Typical Recall-Precision Curve • Top-n precision – proportion of relevant for top n ranked results. • Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.

Probabilistic IR • Basic question: What is the probability that a document, D, is relevant to a query Q? • Probability ranking assumption: If docs retrieved are ordered by decreasing probability of relevance then the overall effectiveness of the system is the best obtainable given the input documents.

Bayes Formulation of Relevance • R – relevance of D with respect to a query Q • D – document (web page) • P(R|D) – probability that a page is relevant given its description (or representation) • NR – D not relevant with respect to Q

Naïve Bayes Independence Assumption • n – the number of words in D • wi – the word in position i in D • Also assume that the probability of a word is independent of its position in the document.

Computing the probabilities • ri – number of times wi occurs in relevant docs • RW - number of words in relevant docs(counting duplicate words multiple times, since docs are bags) • nri – number of times wi occurs in relevant docs • NRW – number of words in non relevant docs • P(R) is the number of relevant docs with respect to Q. • P(NR) is the number of docs which are not relevant with respect to Q. • c – is a smoothing constant greater or equal to one

Is a Document Relevant? • Assume we have a set of training examples of relevant and non-relevant documents to compute P(wi|R) and P(wi|NR) for words wi. • The user could mark docs as R or NR. • Choose the class (R or NR) which has higher probability.

Ranking Documents • D1 is more relevant than D2 given Q, where • nD1 - the number of words in D1 • nD2 – the number of words in D2 • If we are ranking, then P(wi|R) can be approximated on the basis of a document with a weighting function such as TF-IDF.

Other types of Search Engine • Directory – e.g. Yahoo! (Open Directory) • MetaSearch – e.g. Dogpile (Mamma) • Clustering – e.g. Clusty • Question Answering – e.g. Ask JeevesWolframAlpha, True Knowledge, & Google, Yahoo and Bing • Visual – e.g. Quintura • Collaborative – e.g, Omgili,Sproose, AfterVote • Human Input – e.g. ChaCha • Social Tagging – e.g. Blekko, MrTaggy

Directions in Search • Mobile search – e.g. Google Mobile • Local search – e.g. Google Local • Video search – e.g. YouTube • Image search – e.g. Picsearch • Audio search – e.g. Yahoo Audio • Blog search – e.g. Technorati • Social bookmarking - e.g. Delicious • Some new ideas - e.g. Cuil

Paid Inclusion and Paid Placement • Paid inclusion – payment to speed up inclusion in the search index. • Pay-Per-Click (PPC) or Cost-Per-Click (CPC) – payment for being advertised on the search engine’s sponsored results list. • The sponsored list should be separated from the organic list. • PPC is a major revenue source for search engines. • Click fraud is a problem!

Behavioural Targeting • Contextual targeting is a weaker form based on a single user session. • Personalised advertising in order to increase the effectiveness of advertising. • Data collected from individual users, normally through cookies.

Pay-Per-Action (PPA) • Charge the advertiser only when an action takes place such as a purchase, a download or any other trackable action. • Ad network will require the advertiser to place a script in the web page triggering the action. • Some level of trust is needed.

Searching the Web