Data collection with Web crawlers (Web-crawl graphs)

Data collection with Web crawlers(Web-crawl graphs)

further experience: • technical/technological • “treading lightly” • incremental versus batch crawling • HTTP headers • character sets and malformed headers/urls • shallow/deep queries • methodological • minimise modification/distortion of data • maximise accessibility to the data

incremental versus batch crawling

HTTP headers

character sets and malformed headers/urls • cannot assume ASCII • WISER needs support for EU languages! • characters are no longer bytes • cannot assume either HTTP headers or html urls are well formed • may contain arbitrary characters

blinker (Weblink crawler) development blinker is a stable parameterised link crawler based on standard software components • objectives • to identifier problem issues in crawling e.g non-standard servers, malformed data • to demonstrate ethical crawling • to provide Web-crawl graphs • to compare the effect of varying crawling parameters

shallow / deep queries • the query url problem • are not necessarily dynamic • are routinely collected by search engine crawlers • may lead to recursion, but recursion is not eliminated by ignoring them • collecting shallow queries is a compromise • a shallow query is a query url from a Web-page that does not itself have a query url

(further) methodological goals • minimise modification/distortion of data • maximise accessibility These are discussed next in more detail in the context of using xml to exchange Web-crawl graphs.

Data collection with Web crawlers (Web-crawl graphs)