Web Information Retrieval Evolution: Challenges, Search Engines, and Tools

Advanced information retrieval Computer engineering department Chapter 13 – Web IR

Searching the web Introduction (History, Definitions, Aims, Statistics) Tasks of search engines - Gathering - Indexing - “Searching” (Querying and ranking algorithms) - Document and query management Metasearch Browsing and web directory Users and web search Research issues

Web Searching and Classical IR TREC Classic IR research 1970s 1980s 1990s 2000s then came the web web searching

Terminology and Definitions • A web page corresponds to a document in traditional IR • Web pages are different in their size and in the types of files that constitute them • text, graphics, sound, video, GIF, JPEG, ASCII,PDF, etc • IR on the web considers as collection of documents the part of the web that is publicly indexable • exclude pages that cannot be indexed (authorization, dynamic pages) • Location of web pages by navigation • Location of web pages by searching (IR on the web)

Challenges for web search • Problem with the data • Distributed data • High percentage of volatile data • Large volume • Unstructured data • Redundant data • Quality of data • Heterogeneous data • Problem faced by users • How to specify a query? • How to interpret answers?

Search engines • Portals • identification of real names • links to books from Amazon.com • send electronic postcards • translation to other languages • search of other media (metadata) • language-specific searches • weather, stock price, traffic • Business model • targeted advertising with revenue from clickthroughs • word of mouth (since no industry standard evaluation) • fast response (<1s) 24 hours / 7 days a week • filter out spam

Examples Search engines URL AltaVista www.altavista.com Excite www.excite.com Google www.google.com Infoseek www.infoseek.com Lycos www.lycos.com NorthernLight www.nlsearch.com • 25-55% of web covered only! • Some search engines powered by same IR engine • Most based in US, English

Other search engines • Specialised in different countries/languages (e.g., http://www.iol.it/) • Specific • Rank according to popularity (e.g., DirectHit) • Topic oriented (e.g., SearchBroker) • Personnel or institutional pages, electronic mail addresses, images, software applets

Tasks of a web search engine • Document gathering • select the documents to be indexed • Document indexing • represent the content of the selected documents • often 2 indices maintained (full + small for frequent queries) • Searching • represent the user information need into a query • retrieval process (search algorithms, ranking of web pages) • Document and query management • display the results • virtual collection (documents discarded after indexing) vs. physical collection (documents maintained after indexing)

Document gathering • Document gathering = crawling the web • Crawler • Robot, spider, wanderer, walker, knowbot, web search agent • Program that traverses the web to send new or updated pages to be indexed • Run on local server and send requests to remote servers

Crawling the web (1) • Crawling process • Start with set of URLs • Submitted by users or companies • Popular URLs • Breath-first or depth-first • Extract further URLs • Up to 10 millions pages per day • Several crawlers • Problem of redundancy • Web partition  robot per partition

Crawling the web (2) • Up-to-date? • Non-submitted pages need up to 2 months to be indexed • Search engine learns change frequency of pages • Popular pages (having many links to them) crawled more frequently • Guideline for robot behaviours • File placed at root of web server • Indicate web pages that should not be indexed • Avoid overloading servers/sites

Typical anatomy of a large-scale crawler.

Document indexing • Document indexing = building the indices • Indices are variant of inverted files • meta tag analysis • stop words removal + stemming • position data (for phrase searches) • weights • tf x idf; • downweight long URLs (not important page) • upweight terms appearing at the top of the documents, or emphasised terms • use de-spamming techniques • hyperlink information • count link popularity • anchor text from source links • hub and authority value of a page

Maintaining indices over dynamic collections.

Stop-press index • Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted. • Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets • D+ : documents not yet indexed in D0 matching the query • D- : documents matching the query removed from the collection since D0 was constructed. • Stop-press index getting too large • Rebuild the main index • signed (d, t, s) records are sorted in (t, d, s) order and merge-purged into the master (t, d) records • Stop-press index can be emptied out.

Searching • Querying • 1 word or all words must be in the retrieved pages • normalisation (stop words removal, stemming, etc) • complex queries (date, structure, region, etc) • Boolean expressions (advanced search) • metadata • Ranking algorithms • use of web links • web page authority analysis • HITS (Hyperlink Induced Topic Search) • PageRank (Google)

Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully

Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites

Document similarity • Jaccard coefficient of similarity between document and • T(d) = set of tokens in document d • . • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any permutations of the terms. • is a metric

Finding near duplicates algorithm

Use of web links • Web link: represent a relationship between the connected pages • The main difference between standard IR algorithms and web IR algorithms is the massive presence of web links • web links are source of evidence but also source of noise • classical IR: citation-based IR • web track in TREC, 2000, TREC-9: Small Web task (2GB of web data); Large Web task (100GB of web data, 18.5 million documents)

Use of anchor text • represent referenced document • why? • provides more accurate and concise description than the page itself • (probably) contains more significant terms than the page itself • used by ‘WWW Worm’ - one of the first search engines 1994 • representation of images, programs, … • generate page descriptions from anchor text

Algorithms • Query independent page quality • global analysis • PageRank (Google): simulates a random walk across the web and computes the “score” of a page as the probability of reaching the page • Query dependent page quality • local analysis • HITS (Hyperlink Induced Topic Search): focusses on broad topic queries that are likely to be answered with too many pages • the more a page is pointed to by other pages, the more popular is the page • popular pages are more likely to include relevant information than non-popular pages

PageRank (1) • Designed by Brin and Page at Stanford University • Used to implement Google • Main idea: • a page has a high rank if the sum of the ranks of its in-links is high • in-link of page p: a link from a page to page p • out-link of a page p: a link from page p to a page • a high PageRank page has many in-links or few highly ranked in-links • Retrieval: use cosine product (content, feature, term weight) combined with PageRank value

PageRank (2) • Random Surfer Model : user randomly navigates • Initially the surfer is at a random page • At each step the surfer proceeds • to a randomly chosen Web page with probability d called the “damping factor” (e.g. probability of random jump = 0.2) • to a randomly chosen page linked to the current page with probability 1-d (e.g. probability of following a random outlink = 0.8) • Process modelled by Markov Chain • PageRank PR of a page a = probability that the surfer is at page a on a given time PR(a) = Kd + K(1-d) i=1,n PR(ai)/C(ai) d set by system a = page pointed by ai for i=1,n K normalisation factor C(ai) = number of outlinks of ai

HITS: Hypertext Induced Topic Search • Originated from Kleinberg, 1997 • Also referred to as the “The Connectivity Analysis Approach” • Broad topic queries produce large sets of retrieved results • abundance problem  too many relevant documents • new type of quality measure needed  distinguish the most “authoritative” pages  high-quality response to a broad query • HITS: for a certain topic, it identifies • good authorities • pages that contain relevant information (good sources of content) • good hubs • page that point to useful pages (good sources of links)

HITS (2) • Intuition • authority comes from inlinks • being a good hub comes from outlinks • better authority comes from inlinks from good hubs • being a better hub comes from outlinks to good authorities • Mutual reinforcement between hubs and authorities • a good authority page is pointed to by many hub pages • a good hub page point to many authority pages

HITS (3) • Set of pages S that are retrieved • sometimes k (e.g. k = 200) top-ranked pages • Set of pages T that point to or are pointed to by retrieved set of pages S • rank pages according to in_degree (number of in-links) - not effective • Set of pages T : • authoritative pages relevant to query should have a large in_degree • considerable overlap in the sets of pages that point to them  hubs

Algorithm for HITS (General Principle) • Computation of hub and authority value of a page through the iterative propagation of “authority weight” and “hub weight” • Initially all values equal to 1 • Authority weight of page x(p) • if p is pointed to by many pages with large y-values,then it should receive a large x-value x(p) = qip y(qi) • Hub weight of page y(p) • if p points to many pages with large x-values, then it should receive a large y-value y(p) = pqi x(qi) • After each computation (iteration), weights are normalised

Topic distillation • Process of finding ‘quality’ (authority - hub) Web documents related to a query topic, given an initial user information need. • Extensions of HITS • ARC (Automatic Resource Compilation) • distance-2 neighbourhood graph • anchor (& surrounding) text used in computation of hub and authority values • SALSA, etc. • Problems with HITS • mutual reinforcing relationship between hosts • automatically generated links • non relevant highly connected pages • topic drift: generalisation of the query topic

Difference between PageRank and HITS • The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query. • HITS computes authorities and hubs; PageRank computes authorities only. • PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard • Implementation details of PageRank have been reported

Document and query management • Results • Usually screens of 10 pages • Clustering • URL, size, date, abstract, etc • Various sorting • Most similar documents options • Query refinement • Virtual collection vs. physical collection • document can change over time • document may be different to the one that has been indexed • broken link

Metasearch (1) • Problems of Web search engines: • limited coverage of the publicly indexable Web • index different overlapping sections of the Web • based on different IR models  different results to the same query  users do not have the time, knowledge to select the most appropriate search engines with regard to their information need • Possible solution: metasearch engines • Web server that sends query to • Several search engines, Web directories, Databases • Collect results • Unify (merge) them - Data fusion • Aim: better coverage, increase effectiveness

Metasearch (2) • Divided into phases • search engine selection • topic-dependent, past queries, network traffic, … • document selection • how many documents from each search engine? • merging algorithm • utilise rank positions, document retrieval scores, … Metasearcher URL Sources used MetaCrawler www.metacrawler.com 13 Dogpile www.dogpile.com 25 SavvySearch www.search.com > 1000

MetaCrawler

Browsing • Web directory • Catalog, yellow pages, subject categories • Many standard search engines provide subject categories now • Hierarchical taxonomy that classifies human knowledge Arts & Humanities Investing Automotive Kids & Family Business & Economy Life & Style Computers & Internet News Education People Employment Philosophy & Religion Entertainment & Leisure Politics Games Science & Technologies Government Social Science …

Yahoo! • Around one million classified pages • More than 14 country specific directories (Chinese, Danish, French, …) • Manual classification (has its problem) • Acyclic graphs • Search within the classified pages/taxonomy • Problem of coverage

Users and the web (1) • Queries on the web: average values Measure Average value Range Number of words 2.35 0 - 393 Number of operators 0.41 0 - 958 Repetitions of queries 3.97 1 - 1.5 million Queries per user session 2.02 1 - 173325 Screens per query 1.39 1 - 78496

Users and the web (2) • Some statistics • Main purpose: research, leisure, business, education, • products and services (e-commerce) • people and company names and home pages • factoids (from any one of a number of documents) • entire, broad documents • mp3, image, video, audio • 80% do not modify query • 85% look first screen only • 64% queries are unique • 25% users use single keywords (problem for polysemic words and synonyms) • 10% queries are empty

Users and the web (3) • Results of search engines are identical, independent of • user • context in which the user made the request • adding context information for improving search results  focus on the user need and answer it directly • explicit context • query + category • implicit context • based on documents edited or viewed by user • personalised search • previous requests and interests, user profile

Motivation • Problem: Query word could be ambiguous: • Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. • Solution: Visualisation • Clustering document responses to queries along lines of different topics. • Problem 2: Manual construction of topic hierarchies and taxonomies • Solution: • Preliminary clustering of large samples of web documents. • Problem 3: Speeding up similarity search • Solution: • Restrict the search for documents similar to a query to most representative cluster(s).

Example Scatter/Gather, a text clustering system, can separate salient topics in the response to keyword queries. (Image courtesy of Hearst)

Clustering • Task :Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. • Cluster Hypothesis:Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/tbelongs. • Similarity measures • Represent documents by TFIDF vectors • Distance between document vectors • Cosine of angle between document vectors

Clustering: Parameters • Similarity measure: (eg: cosine similarity) • Distance measure:(eg: eucledian distance) • Number “k” of clusters • Issues • Large number of noisy dimensions • Notion of noise is application dependent

Clustering: Formal specification • Partitioning Approaches • Bottom-up clustering • Top-down clustering • Geometric Embedding Approaches • Self-organization map • Multidimensional scaling • Latent semantic indexing • Generative models and probabilistic approaches • Single topic per document • Documents correspond to mixtures of multiple topics

Partitioning Approaches • Partition document collection into k clusters • Choices: • Minimize intra-cluster distance • Maximize intra-cluster semblance • If cluster representations are available • Minimize • Maximize • Soft clustering • d assigned to with `confidence’ • Find so as to minimize or maximize • Two ways to get partitions - bottom-up clustering and top-down clustering

Bottom-up clustering(HAC) • Initially G is a collection of singleton groups, each with one document • Repeat • Find ,  in G with max similarity measure, s() • Merge group  with group  • For each  keep track of best  • Use above info to plot the hierarchical merging process (DENDOGRAM) • To get desired number of clusters: cut across any level of the dendogram

Dendogram A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Similarity measure • Typically s() decreases with increasing number of merges • Self-Similarity • Average pair wise similarity between documents in  • = inter-document similarity measure (say cosine of tfidf vectors) • Other criteria: Maximium/Minimum pair wise similarity between documents in the clusters

Web Information Retrieval Evolution: Challenges, Search Engines, and Tools

Web Information Retrieval Evolution: Challenges, Search Engines, and Tools

Presentation Transcript

Information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Advanced Information- Retrieval Models

Text Information Retrieval and Applications – Advanced Topics

FYP0202 Advanced Audio Information Retrieval System

Information Retrieval

Advanced information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval