140 likes | 333 Views
Crawl RSS. Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris. The problem. Certain sites change very frequently News sites especially
E N D
Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris
The problem • Certain sites change very frequently • News sites especially • While we can capture all the stories by visiting once per day, week, month or even year they may have been modified several times and the front page changes will be missed
RSS feed advantages • Changes to the feed is highly likely to signify an actual change has occurred • A single RSS feed informs on changes both to the presumed “front page” as well as article or item pages • RSS feeds are generally smaller (in bytes) then the front page (just html) of a site • Crawling the RSS feed frequently is more likely to be tolerated
How it works 1/4 • On first load all feed elements are loaded • A feed element is uniquely identified by its • URL • Timestamp • Each element plus front page is visited • Embeds are downloaded • No further links are followed • Strict controls need to be in place to halt scope leakage • Each feed element should lead to a very finite number of URLs to crawl • Basically, just get minimal embedds, do not follow links
How it works 2/4 • Once all the URLs generated by the initial feed elements have been crawled the RSS feed may be revisited • IF the minimum wait between visits has elapsed • ELSE wait until the minimum time has elapsed • The second visit will (probably) show many already seen elements • Identified by url+timestamp • If feed is entirely unchanged than the content hash will likely be unchanged • If an url has a new timestamp it is probable that the content of the item has changed • Only load items that have a timestamp that is more recent than the ‚most recently seen‘ timestamp for each feed
How it works 3/4 • If there are changed or new elements • Fetch ‘front page’ URI and URIs of changed and new elements • If they match existing content hashes, they may be discarded, otherwise written to (W)ARCs. • Do not revisit embedded content that we have already crawled • This massively reduces the amount of time it takes to complete each RSS visit
How it works 4/4 • Once visit 2 is over • Check has minimum wait elapsed, • rinse, • repeat
Sites • Many sites have multiple feeds • Sometimes items will appear in more than one feed at a time • It is therefor possible to have multiple related feeds for one site • Such feeds are always crawled jointly and duplicate items are discarded
Example RSS Site: ruv.is State: HOLD_FOR_FEED_EMIT Number of discovered items: 0 Minimum wait between emitting feeds (ms): 600000 Earliest next feed emission: Mon May 12 14:49:48 GMT 2014 URLs being crawled: 0 Feeds last emitted: Mon May 12 14:39:48 GMT 2014 Feeds: Feed: http://www.ruv.is/rss/frettir Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ Feed: http://www.ruv.is/rss/erlent Most recent seen: Mon May 12 14:11:50 GMT 2014 http://www.ruv.is/ http://www.ruv.is/erlent Feed: http://www.ruv.is/rss/sport Most recent seen: Sun May 11 22:55:17 GMT 2014 http://www.ruv.is/ http://www.ruv.is/ithrottir Feed: http://www.ruv.is/rss/innlent Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ http://www.ruv.is/innlent
Configuration • Either via Heritrix’s CXML • Or using the database interface • Maintaining the DB is outside the scope of the add-on • Easy to add not configuration handlers
Crawl RSS - Heritrix 3 add-on • Available on GitHub: • https://github.com/Landsbokasafn/crawlrss • Requires Heritrix 3.1.2 or newer • Stable, but still technically in ‘beta’ • In use at NULI for almost a year now • First new sites • Now also select blogs and government sites