160 likes | 169 Views
Chapter 2 : The Web and the Problem of Search. The size of the web, and how is it measured. Search engine usage statistics. The bow-tie structure of the web. The small-world web. Web information seeking strategies. A taxonomy of web searches. Web search versus Information Retrieval.
E N D
Chapter 2 : The Web and theProblem of Search • The size of the web, and how is it measured. • Search engine usage statistics. • The bow-tie structure of the web. • The small-world web. • Web information seeking strategies. • A taxonomy of web searches. • Web search versus Information Retrieval. • Differences between global and local search. • Differences between search and navigation.
Web size statistics • Number of accessible web pages – latest estimate, May 2005, 11.5 billion. • The deep (or hidden or invisible) web contains 400-550 times more information. • Coverage (i.e. the proportion of the web indexed) is crucial for search engines.
Measuring the size of the web • Capture-recapture method • SE1 is the number of pages indexed first search engine. • QSE2 is the number of pages returned by second search engine for typical queries. • OVR is the number of pages returned by both search engines for typical queries. • Estimate = (SE1 x QSE2)/OVR • Estimate of 64.81 million web sites as of June 2005.
Web usage statistics • Over 10% of the world’s population were online as of late 2004. • Number of broadband users is growing (over 50% of connected Americans use broadband). • Search engine usage as of June 2004: • Google (41.6%), Yahoo! (31.5%), MSN (27.4%), AOL (13.6%), Ask Jeeves (7%) • 200 million hits per day to Google (mid 2004).
Tabular Data versus Web Data Figure 2.1: A database table versus a web site
Structure of the web Figure 2.2: Map of the Internet (1998)
Structure of the web Figure 2.3: Web pages related to dcs.bbk.ac.uk (see www.touchgraph.com)
Structure of the web Figure 2.4: Bow-tie shape of the web
The small-world web • Over 75% of the time there is no directed path from one random web page to another. • When a directed path exists its average length is 16 clicks. • When an undirected path exists its average length is 7 clicks. • Short average path between pairs of nodes is characteristic of a small-world network.
Web information seeking strategies • Direct navigation • Enter the URL directly into the browser. • Navigation within a directory • Use a web portal as an entry point to the web. • Information seeking on the web is problematic and more users are turning to search engines.
Navigation using a search engine Figure 2.5: Information seeking
A taxonomy of web searches • Informational – acquire some information about a topic from web pages. • Navigational – find a site to start navigation from. • Transactional – perform some activity mediated by a web site.
Web search versus Information Retrieval • The scale of web search is way beyond traditional information retrieval. • The web is very dynamic. • The web contains an enormous amount of duplication. • The quality of web pages is not uniform. • The range of topics on the web is open. • The web is globally distributed. • Users typical habits are different (short queries, inspect only top-10 pages). • The web is hypertextual.
Information retrieval evaluation Figure 2.6: Recall versus precision
Differences between global and local search • Local search engines on web sites have a bad reputation. • Users often use a web search engine such as Google or Yahoo! to find information on web sites, rather than the local web site search engine. • Many companies do not invest in local search. • Content management is a problem. • Language may be a problem. • Information needs on web sites may be different.
Differences between search and navigation • Search – employing a search engine to find information. • Navigation (or surfing) – employing a link-following strategy to find information. • The web encourages a combination of search, navigation and browsing.