1 / 16

Crawl operators workshop

Crawl operators workshop. IIPC GA 2014 – Paris Kristinn Sigurðsson. Scope. A sequence of DecideRules All rules are processed Each rule can will either be a match or not If not a match, the rule will PASS (have no effect) If it matches, it will either

ciaran-fry
Download Presentation

Crawl operators workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawl operators workshop IIPC GA 2014 – Paris Kristinn Sigurðsson

  2. Scope • A sequence of DecideRules • All rules are processed • Each rule can will either be a match or not • If not a match, the rule will PASS (have no effect) • If it matches, it will either • ACCEPT (means that the URI should be ruled in scope) • REJECT (means the the URI should be ruled out of scope • Last rule that does not PASS “wins”

  3. Common example • REJECT RejectDecideRule • Default position: Not in scope • Nothing gets through unless some rule explicitly decides it should • ACCEPT SurtPrefixedDecideRule • Rules items in based on their domain • Often uses the seeds list as a source for allowed domains • REJECT TooManyHopsDecideRule • Throw out items that are too far from the seeds • ACCEPT TransclusionDecideRule • Get embeds on domains that are otherwise not in scope • REJECT MatchesListRegexDecideRule • Filter out known bad actors • Regular expressions used to match problem URIs • This filter can also be configured to ACCEPT, but it is rarely used that way in scoping • ACCEPT PrerequisiteAcceptDecideRule • Regardless of anything else, we still want to get any prerequisites (dns, robots.txt)

  4. HopsPathMatchesRegexDecideRule • For RSS crawling, it was really important to have tightly controlled scope • Just the embeds • Still need to follow redirects though • Works like MatchesRegexDecideRule • Except the regular expression is applied to the “hoppath” from seed • .R?((E{0,2})|XE?) • Allow one redirect, then up to two levels of embedds or a speculative embed and then an embed • Can also be used to avoid excessive speculative embeds • ^[^X]*X[^X]*X[^X]*X[^X]*X.*$. • Allows a maximum of 4 speculative embeds on the hoppath • Can also use this with sheets/overrides to affect only select sites with known issues

  5. More rules • MatchesFilePatternDecideRule • Helpful utility version of MatchesRegexDecideRule • Has pre-compiled regular expressions that match common filetypes • PathologicalPathDecideRule • Rejects paths with more than X identical path segments • X is by default 2 • ScriptedDecideRule • Allows the operator to specify arbitrary conditions that are expressed with BeanShell script • Beanshell scripts are also used in H3 scripting console • http://www.beanshell.org/ • Offers great flexibility • For regularly used actions it may be better to create a custom Decide Rule

  6. Keep in mind • Some DecideRules operate on content • E.g. ContentLengthDecideRule • These will not work for scoping

  7. Getting more advanced

  8. Adding a regex to decide rule

  9. Working with sheets • Add SURT to Sheet override appCtx.getBean("sheetOverlaysManager") .addSurtAssociation("[SURT]","[The sheets bean name]") • Add rule to sheet appCtx.getBean( "[Sheet ID]").map.put("[KEY]","[VALUE]");

  10. Canonicalization • Check canonicalization of an URL rawOut.println(appCtx.getBean( "canonicalizationPolicy" ).canonicalize("URL")); • Add RegexRule canonicalization org.archive.modules.canonicalize.RegexRule rule = new org.archive.modules.canonicalize.RegexRule(); rule.setRegex(java.util.regex.Pattern.compile("regex")); rule.setFormat("format"); // Optional! Defaults to "$1" appCtx.getBean( "preparer" ).canonicalizationPolicy.rules.add(rule);

  11. Getting really advanced • Creating custom models for Heritrix isn’t that difficult, assuming a modest amount of Java knowledge • Use a recent version of Eclipse • Create a Maven project with a “provided” dependency on Heritrix

  12. Custom modules cont. • Then setup a run configuration that uses the org.archive.crawler.Heritrix class from the dependency

  13. Custom modules cont. • Then just write new models in your project and wire them in via the CXML configuration as normally • CXML configuration • https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Jobs+and+Profiles

  14. Robots • How can you ignore robot, but still keep track of which URLs where excluded by robots.txt • Without consulting the robots.txt file post-crawl? • PreconditionEnforcer • calculateRobotsOnly • Each URL that would have been excluded gets an annotation in the crawl.log

More Related