Web Crawling Framework Development for Information Retrieval Applications

Websphinx & Webgraph Inf 141 Information Retrieval Winter 2008

Assignment 3 • See course webpage for specifications • Due Friday Feb 8th • Working in groups of 2-3 people • Email with subject: Inf 141 Team Registration • Train your group • Each member of your group must be able to run your architecture on their own for Assignment 04. • Quiz next wednesday

Assignment 3

Websphinx • www.cs.cmu.edu/~rcm/websphinx/ • To write a crawler, extend class Crawler and override shouldVisit () and visit() to create your own crawler. • visit(): The page is passed to the crawler's visit() method for user-defined processing. • shouldVisit(Link l): Callback for testing whether a link should be traversed. • Default returns true for all links. • Override for other behaviors. • http://www.cs.cmu.edu/~rcm/websphinx/doc/index.html

Websphinx • Create an array consisting of your seed set of links • Look at the Link Class • Links to webpage • Make a link from a string URL • Make a link from a start tag and end tag • Look at Page Class • Mainly supports automatically parsed HTML pages • Parsing produces a list of tags, words, an HTML parse tree, links • Can make pages

Webgraph • Webgraph is a framework to study the web graph • Use ArrayListMutableGraph class • Mutable graph class based on IntArrayList • Creates a new mutable graph copying a given immutable graph • ArrayListMutableGraph(ImmutableGraph g) • View ImmutableGraph class • http://webgraph.dsi.unimi.it/docs/

Questions ?

Web Crawling Framework Development for Information Retrieval Applications

Web Crawling Framework Development for Information Retrieval Applications

Presentation Transcript

Ranking