1 / 17

Web Crawler

Workshop, 3rd Semester. Web Crawler. What is a Web Crawler ?. A crawler could be a part of: a search engine, where every page is searched for a search string a spam agent that visits all sites and collects email addresses a site map creation application. Goal for workshop.

seth
Download Presentation

Web Crawler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop, 3rd Semester Web Crawler

  2. What is a Web Crawler ? • A crawler could be a part of: • a search engine, where every page is searched for a search string • a spam agent that visits all sites and collects email addresses • a site map creation application

  3. Goal for workshop • Develop a web crawler that takes an URI as input and visits all pages at the identified site. • The web crawler must search HTML documents for links and build a graph containing documents as vertices (nodes) and links as edges.

  4. Subjects • Graphs • HTML • Simple parsing • http and tcp • Design patterns

  5. Graphs • To represent the site you can use a directed graph: • HTML documents being represented by vertices (nodes) • links being represented by edges

  6. HTML • Every page the crawler visits is a HTML document. • The HTML document should be investigated for link tags • <a href = ”http://www.prenhall.com”>Prentice Hall</a> • Frames (frame sets) are to be considered as individual and independent HTML pages

  7. Simple parsing • The simple version searches the HTML documents for links by using ”brute force” • It is obvious to investigate other techniques • state machines • regular expressions • recursive descend

  8. Communication • For communication use the .NET sockets class • HTTP and TCP • URI, URL etc.

  9. Design patterns • Design patterns should be applied wherever appropriate

  10. The program • Overall structure • Graph • HTML • HTTP • Crawler

  11. The Crawler

  12. The Graph Package

  13. The Communication Package

  14. The HTML Package

  15. Testsite • http://public.noea.dk/fen/testsite/

  16. Schedule for the Week • Day 1 – Monday • .Net socket class and graph implementation. (PQC) • Day 2 – Tuesday • Working with practical html parsing, coupling with the http class. Parsing techniques. (CHJO) • Day 3 – Wednesday • Continue working on your own. • Day 4 – Thursday • Putting it all together: graph-, html- and the http part. The crawler algorithm. (PQC) • Day 5 – Friday • Finishing up and preparing the presentation (CHJO). • Presentation and evaluation • Thursday, April 29, 8.30 – 10.00

  17. Evaluation • Nothing in writing, but every team gives a presentation: • Results are presented in Powerpoint presentations: • class model • design issues etc. • A demo • Review and discussion of interesting code fragments • Approximately 20 minutes per team • Participation in the evaluation is mandatory. • Need to hand in report if you do not attend evaluation.

More Related