120 likes | 205 Views
Crawl Operators’ Workshop. Roger G. Coram. Topics. ExternalGeoLocationDecideRule Sheets IpAddressSetDecideRule. ExternalGeoLocationDecideRule. Legal Deposit legislation passed in April 2013. The Legal Deposit Libraries (Non-Print Works) Regulations 2013:
E N D
Crawl Operators’ Workshop Roger G. Coram
Topics • ExternalGeoLocationDecideRule • Sheets • IpAddressSetDecideRule
ExternalGeoLocationDecideRule • Legal Deposit legislation passed in April 2013. • The Legal Deposit Libraries (Non-Print Works) Regulations 2013: • 18 (1) “…a work published on line shall be treated as published in the United Kingdom if: • “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”
Geolocation • ExternalGeoLocationDecideRule requires: • A list of ISO 3166-1 country-codes to be included in the crawl • GB, FR, DE, etc. • An Implementation of ExternalGeoLookupInterface.
ExternalGeoLookupInterface • Our implementation is based on MaxMind’s GeoLite2 database. • Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’. • Only ~30MB; can be held in memory.
crawler-beans.cxml <!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean> <!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean> Configuration example:
Results • Short test crawl (1,000,000 seeds) produced: • 89,500,755 URLs in total. • 26,072 non-UK URLs which would not otherwise been in scope. • 137 distinct hosts.
IP-based Sheets “Hi, “I'm a senior system administrator for Webfusion / 123-reg. “We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.” • Large number of hosts on a single machine. • Need a way to reduce the load on a specific IP address.
Sheets • “Sheets provide the ability to replace default settings on a per domain basis.” • Allow you to change any value on any named bean for a specific set of URLs. • Actually quite flexible: • SurtPrefixesSheetAssociation • Applied by matching SURT prefixes. • DecideRuledSheetAssociation: • Applied a series of DecideRules. • IpAddressSetDecideRule
1. crawler-beans.cxml <bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean> <bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean> Configuration example:
2. crawler-beans.cxml <bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean> Configuration example:
Thank you GitHub: https://github.com/ukwa/bl-heritrix-modules MaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/