80 likes | 226 Views
Data collection with Web crawlers (Web-crawl graphs). further experience:. technical/technological “treading lightly” incremental versus batch crawling HTTP headers character sets and malformed headers/urls shallow/deep queries methodological minimise modification/distortion of data
E N D
further experience: • technical/technological • “treading lightly” • incremental versus batch crawling • HTTP headers • character sets and malformed headers/urls • shallow/deep queries • methodological • minimise modification/distortion of data • maximise accessibility to the data
character sets and malformed headers/urls • cannot assume ASCII • WISER needs support for EU languages! • characters are no longer bytes • cannot assume either HTTP headers or html urls are well formed • may contain arbitrary characters
blinker (Weblink crawler) development blinker is a stable parameterised link crawler based on standard software components • objectives • to identifier problem issues in crawling e.g non-standard servers, malformed data • to demonstrate ethical crawling • to provide Web-crawl graphs • to compare the effect of varying crawling parameters
shallow / deep queries • the query url problem • are not necessarily dynamic • are routinely collected by search engine crawlers • may lead to recursion, but recursion is not eliminated by ignoring them • collecting shallow queries is a compromise • a shallow query is a query url from a Web-page that does not itself have a query url
(further) methodological goals • minimise modification/distortion of data • maximise accessibility These are discussed next in more detail in the context of using xml to exchange Web-crawl graphs.