70 likes | 250 Views
Websphinx & Webgraph. Inf 141 Information Retrieval Winter 2008. Assignment 3. See course webpage for specifications Due Friday Feb 8 th Working in groups of 2-3 people Email with subject: Inf 141 Team Registration Train your group
E N D
Websphinx & Webgraph Inf 141 Information Retrieval Winter 2008
Assignment 3 • See course webpage for specifications • Due Friday Feb 8th • Working in groups of 2-3 people • Email with subject: Inf 141 Team Registration • Train your group • Each member of your group must be able to run your architecture on their own for Assignment 04. • Quiz next wednesday
Websphinx • www.cs.cmu.edu/~rcm/websphinx/ • To write a crawler, extend class Crawler and override shouldVisit () and visit() to create your own crawler. • visit(): The page is passed to the crawler's visit() method for user-defined processing. • shouldVisit(Link l): Callback for testing whether a link should be traversed. • Default returns true for all links. • Override for other behaviors. • http://www.cs.cmu.edu/~rcm/websphinx/doc/index.html
Websphinx • Create an array consisting of your seed set of links • Look at the Link Class • Links to webpage • Make a link from a string URL • Make a link from a start tag and end tag • Look at Page Class • Mainly supports automatically parsed HTML pages • Parsing produces a list of tags, words, an HTML parse tree, links • Can make pages
Webgraph • Webgraph is a framework to study the web graph • Use ArrayListMutableGraph class • Mutable graph class based on IntArrayList • Creates a new mutable graph copying a given immutable graph • ArrayListMutableGraph(ImmutableGraph g) • View ImmutableGraph class • http://webgraph.dsi.unimi.it/docs/