100 likes | 232 Views
Web Science: Searching the web. Basic Terms. Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program that surfs the web and indexes and/or copies the website Also known as bots, web spiders, web robots Meta-tag
E N D
Basic Terms • Search engine • Software that finds information on the Internet or World Wide Web • Web crawler • An automated program that surfs the web and indexes and/or copies the website • Also known as bots, web spiders, web robots • Meta-tag • Extra information that tags the HTML document • <meta name="keywords" content="HTML,CSS,XML,JavaScript"> • HyperLink or Link • A reference/link to another web page
How do you evaluate a search engine? • Time taken to return results • Number of results • Quality of results
How does a web crawler work? • Start at a webpage • Download the HTML content • Search for the HTML link tags <a href=“URL”></a> • Repeat steps 2-3 for each of the links • When a website has been completely indexed, load and crawl other websites
Parallel Web Crawling • Speed up your web crawling by running on multiple computers at the same time (i.e. parallel computing • How often should you crawl the entire Internet? • How many copies of the Internet should you keep? • What are the different ways to index a webpage? • Meta keywords • Content • Page rank (# links to page)
Basic Search Engine Algorithm • Crawl the Internet • Save meta keywords for every page • Save the content and popular words on the page • When somebody needs to find something, search for matching keywords or content words Problem: • Nothing stops you from inserting your own keywords or content that do not relate to the page’s *actual* content
PageRank Algorithm • Crawl the Internet • Save the content and index the contents’ popular words • Identify the links on the page • Each link to an already indexed page increases the PageRank of that linked page • When somebody needs to find something, search for matching keywords or content words, BUT rank the search results according to PageRank Problem: Create a bunch of websites that link to a single specific page (http://en.wikipedia.org/wiki/Google_bomb)
Shallow Web vs. Deep Web • Shallow web • Websites and content that are easily visible to “dumb search engines” • Content publicly links to other content • Shallow web content tends to be static content (unchanging) • Deep web • Websites and content that tend to be dynamic and/or unlinked • Private web sites • Unlinked content • Smarter search engines can crawl the deep web
Search Engine Optimization (SEO) • Meta keywords • Words the relate to your content • Human-readible URLs • i.e. avoid complicated dynamically created URLs • Links to your page on other websites • Page visits • Others? • White hat vs. black hat SEO • White hats are the good guys. When would they be used? • Black hats are the bad guys. When would they be used?
Search Engine Design • Assumptions are key to design! • Major problem in older search engines: • People gamed the search results • Results were not tailored to the user • What assumptions does a typical search engine make now? (i.e. what factors influence search today?)