Web Science: Searching the web

Web Science: Searching the web

Basic Terms • Search engine • Software that finds information on the Internet or World Wide Web • Web crawler • An automated program that surfs the web and indexes and/or copies the website • Also known as bots, web spiders, web robots • Meta-tag • Extra information that tags the HTML document • <meta name="keywords" content="HTML,CSS,XML,JavaScript"> • HyperLink or Link • A reference/link to another web page

How do you evaluate a search engine? • Time taken to return results • Number of results • Quality of results

How does a web crawler work? • Start at a webpage • Download the HTML content • Search for the HTML link tags <a href=“URL”></a> • Repeat steps 2-3 for each of the links • When a website has been completely indexed, load and crawl other websites

Parallel Web Crawling • Speed up your web crawling by running on multiple computers at the same time (i.e. parallel computing • How often should you crawl the entire Internet? • How many copies of the Internet should you keep? • What are the different ways to index a webpage? • Meta keywords • Content • Page rank (# links to page)

Basic Search Engine Algorithm • Crawl the Internet • Save meta keywords for every page • Save the content and popular words on the page • When somebody needs to find something, search for matching keywords or content words Problem: • Nothing stops you from inserting your own keywords or content that do not relate to the page’s *actual* content

PageRank Algorithm • Crawl the Internet • Save the content and index the contents’ popular words • Identify the links on the page • Each link to an already indexed page increases the PageRank of that linked page • When somebody needs to find something, search for matching keywords or content words, BUT rank the search results according to PageRank Problem: Create a bunch of websites that link to a single specific page (http://en.wikipedia.org/wiki/Google_bomb)

Shallow Web vs. Deep Web • Shallow web • Websites and content that are easily visible to “dumb search engines” • Content publicly links to other content • Shallow web content tends to be static content (unchanging) • Deep web • Websites and content that tend to be dynamic and/or unlinked • Private web sites • Unlinked content • Smarter search engines can crawl the deep web

Search Engine Optimization (SEO) • Meta keywords • Words the relate to your content • Human-readible URLs • i.e. avoid complicated dynamically created URLs • Links to your page on other websites • Page visits • Others? • White hat vs. black hat SEO • White hats are the good guys. When would they be used? • Black hats are the bad guys. When would they be used?

Search Engine Design • Assumptions are key to design! • Major problem in older search engines: • People gamed the search results • Results were not tailored to the user • What assumptions does a typical search engine make now? (i.e. what factors influence search today?)

Web Science: Searching the web

Web Science: Searching the web

Presentation Transcript

Searching the Web CS3352 Searching the Web

Searching Web of Science

Searching Web of Science

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching The Web