1 / 21

www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov ( serge@gserge.com )

Web Crawling. www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov ( serge@gserge.com ). Content. What is web crawling? What can you do with web crawlers? Googlebot Web crawling process Simple architecture Advanced architecture – spider-monkey

stesha
Download Presentation

www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov ( serge@gserge.com )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Crawling www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov (serge@gserge.com)

  2. Content • What is web crawling? • What can you do with web crawlers? • Googlebot • Web crawling process • Simple architecture • Advanced architecture – spider-monkey • What is and why to crawl the hidden web? • Questions to ask “before you crawl”

  3. What is web crawling? • Process of browsing Internet/WWW to collect index of data, analyze it and store for future reference • Crawlers must be able to download many pages at a short period of time and update already downloaded pages • Crawlers are used by search engines, marketing companies, researchers and others

  4. What can you do with web crawlers? • “Download” the Internet • Quickly find important information for marketing purposes • Study societies/nations/groups of people • Analyze malware, spyware and “junk” on the Internet • Count most repeated words and letters in the world

  5. Googlebot • Crawler used by google to index, cache the Internet • Step 1: Visit a number of pages -> extract all links -> visit all pages -> extract all links -> etc. • Step 2: Algorithm called PageRank assesses a specific page's importance by how many other Web pages link to it and by the importance of those linking pages

  6. Googlebot The PageRank of a particular page is roughly based upon the quantity of inbound links as well as the PageRank of the pages providing the links. PageRank is given from 0 to 10

  7. Googlebot • Web Page URL: http://facebook.com • The Page Rank: 9/10 • Web Page URL: http://honeynet.org • The Page Rank: 6/10 • Web Page URL: http://honeynetproject.ca • The Page Rank: 3/10

  8. Web crawling process • Two important steps must be established before starting to “crawl”: • 1)Find a starting point - a list lf initial URLs to start the search (Seeds) • Start from some known links • Use web search engines • 2) Determine a scope - how wide the crawling should go • maximum links hops to include(URL with a particular number of links) • Transitive hops to include (URL with a particular number of transitive hope )

  9. Simple Architecture

  10. Simple Architecture • Queue – a list of pages to be processed/downloaded • Schedulers and revisiting policy – after web has been “downloaded” its content will most likely be out of date. Revisiting policy must consider this • Downloader: • Parallelization - downloading all pages in parallel • Serialization - downloading only one page at a time at the maximum speed

  11. Advanced Architecture Spider-Monkey

  12. Spider-Monkey Seeder • Generates a list of URLs • Method 1: Web search • Method 2: Extract URLs from spam emails • The monitoring seeder is used to constantly reseed previously found malicious content • over time from malware database

  13. Spider-Monkey Web Crawling - Heritrix • Open source • Queues the generated URLs from the seeder • Stores the crawled contents on the file server while generating detailed log files • Multi-threaded design • Link extraction • Web and JMX interface

  14. Spider-Monkey Web Crawling - Heritrix

  15. Spider-Monkey Malware analysis • Scanner extracts ARC-files • Analyzes content with multiple AVs • Identified malware and malicious Web sites are stored in the malware directory • Information regarding the malicious content is stored in the database

  16. Crawling hidden web • Tapping into unexplored information • Improving user experience • Due to the heavy reliance of many Web users on search engines for locating information, search engines influence how the users perceive the Web • Users do not necessarily perceive what actually exists on the Web but what is indexed by search engines

  17. Hidden Web database model • Textual database • Site that mainly contains plain-text documents • Simple search interface where users type a list of keywords in a single search box • Structured database • Multi-attribute relational data • Multi-attribute search interfaces

  18. Textual crawler • Crawler has to generate a query, issue it to the Web site • Download the result, index page, and follow the links to download the actual pages • Everything comes down to the query submitted to search • Some studies suggest that hidden web is about 500 times larger than public web

  19. Questions to ask “Before you crawl” • What information are you looking for? • What sites to crawl? • What content to crawl? • How to extract links from the crawled content? • Determine necessary crawling performance • What and where to store data? what format? • How to analyze data?

  20. Let’s crawl

  21. References • http://oak.cs.ucla.edu/~cho/research/crawl.html - web crawling research project • http://www.prchecker.info/check_page_rank.php - check page ranking • http://www.wisegeek.com/what-is-a-web-crawler.htm • http://monkeyspider.sourceforge.net/Diploma-Thesis-Ali-Ikinci.pdf

More Related