160 likes | 256 Views
Crawl operators workshop. IIPC GA 2014 – Paris Kristinn Sigurðsson. Scope. A sequence of DecideRules All rules are processed Each rule can will either be a match or not If not a match, the rule will PASS (have no effect) If it matches, it will either
E N D
Crawl operators workshop IIPC GA 2014 – Paris Kristinn Sigurðsson
Scope • A sequence of DecideRules • All rules are processed • Each rule can will either be a match or not • If not a match, the rule will PASS (have no effect) • If it matches, it will either • ACCEPT (means that the URI should be ruled in scope) • REJECT (means the the URI should be ruled out of scope • Last rule that does not PASS “wins”
Common example • REJECT RejectDecideRule • Default position: Not in scope • Nothing gets through unless some rule explicitly decides it should • ACCEPT SurtPrefixedDecideRule • Rules items in based on their domain • Often uses the seeds list as a source for allowed domains • REJECT TooManyHopsDecideRule • Throw out items that are too far from the seeds • ACCEPT TransclusionDecideRule • Get embeds on domains that are otherwise not in scope • REJECT MatchesListRegexDecideRule • Filter out known bad actors • Regular expressions used to match problem URIs • This filter can also be configured to ACCEPT, but it is rarely used that way in scoping • ACCEPT PrerequisiteAcceptDecideRule • Regardless of anything else, we still want to get any prerequisites (dns, robots.txt)
HopsPathMatchesRegexDecideRule • For RSS crawling, it was really important to have tightly controlled scope • Just the embeds • Still need to follow redirects though • Works like MatchesRegexDecideRule • Except the regular expression is applied to the “hoppath” from seed • .R?((E{0,2})|XE?) • Allow one redirect, then up to two levels of embedds or a speculative embed and then an embed • Can also be used to avoid excessive speculative embeds • ^[^X]*X[^X]*X[^X]*X[^X]*X.*$. • Allows a maximum of 4 speculative embeds on the hoppath • Can also use this with sheets/overrides to affect only select sites with known issues
More rules • MatchesFilePatternDecideRule • Helpful utility version of MatchesRegexDecideRule • Has pre-compiled regular expressions that match common filetypes • PathologicalPathDecideRule • Rejects paths with more than X identical path segments • X is by default 2 • ScriptedDecideRule • Allows the operator to specify arbitrary conditions that are expressed with BeanShell script • Beanshell scripts are also used in H3 scripting console • http://www.beanshell.org/ • Offers great flexibility • For regularly used actions it may be better to create a custom Decide Rule
Keep in mind • Some DecideRules operate on content • E.g. ContentLengthDecideRule • These will not work for scoping
Working with sheets • Add SURT to Sheet override appCtx.getBean("sheetOverlaysManager") .addSurtAssociation("[SURT]","[The sheets bean name]") • Add rule to sheet appCtx.getBean( "[Sheet ID]").map.put("[KEY]","[VALUE]");
Canonicalization • Check canonicalization of an URL rawOut.println(appCtx.getBean( "canonicalizationPolicy" ).canonicalize("URL")); • Add RegexRule canonicalization org.archive.modules.canonicalize.RegexRule rule = new org.archive.modules.canonicalize.RegexRule(); rule.setRegex(java.util.regex.Pattern.compile("regex")); rule.setFormat("format"); // Optional! Defaults to "$1" appCtx.getBean( "preparer" ).canonicalizationPolicy.rules.add(rule);
Getting really advanced • Creating custom models for Heritrix isn’t that difficult, assuming a modest amount of Java knowledge • Use a recent version of Eclipse • Create a Maven project with a “provided” dependency on Heritrix
Custom modules cont. • Then setup a run configuration that uses the org.archive.crawler.Heritrix class from the dependency
Custom modules cont. • Then just write new models in your project and wire them in via the CXML configuration as normally • CXML configuration • https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Jobs+and+Profiles
Robots • How can you ignore robot, but still keep track of which URLs where excluded by robots.txt • Without consulting the robots.txt file post-crawl? • PreconditionEnforcer • calculateRobotsOnly • Each URL that would have been excluded gets an annotation in the crawl.log