1 / 48

Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot

Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot. User :Bináris Hungarian Wikipedia & Pywikipedia developer team Wikimania 2012. From Budapest. Useful links. [[ meta:User:Bináris ]] Just check it now on your laptop to follow me.

chatham
Download Presentation

Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot User:Bináris Hungarian Wikipedia & Pywikipedia developer team Wikimania 2012 From Budapest

  2. Useful links [[meta:User:Bináris]] Just check it now on your laptop to follow me

  3. What is this about? My spellchecker underlined occurence. Wiktionary: Noun occurence Common misspelling of occurrence. A search in English Wikipedia:Results 1–20 of 333,623 for occurenceDoes this include every erronious form?

  4. We speak about Pywikipedia bot framework replace.py fixes.py This works on every MediaWiki installation!

  5. Some ideas Spellink corrections Linking and unlinking Mass change of section titles Execution of naming conventions Replacing templates Replacing template parameters Placing templates Correcting link errors

  6. Decisions • Command line parameters or fix? • Searching in live wiki or in dump? • Search & replace in one run or separately? • Simple text replacements or regular expressions? • Manual or automatic running?

  7. The two-pass model of replacement • Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic • Optionally uploading the list to your wiki(line numbers help to clean) • Making the actual replacementsFaster (or very fast) and attended

  8. Decisions Command line parameters or fix? Searching in live wiki or in dump? Search & replace in one run or separately? Simple text replacements or regular expressions? Manual or automatic running?

  9. What is a fix? • A fix contains a replacement task. • See the links on my Meta page for description & examples

  10. The magic of regular expressions

  11. Decisions Command line parameters or fix? Searching in live wiki or in dump? Search & replace in one run or separately? Simple text replacements or regular expressions? Manual or automatic running?

  12. Regular expressions • color colour: this is concrete and accidental (and uninteresting :-P) • What about changing[[január 4]]. to [[január 4.]] and [[január 4]]-én to [[január 4.|január 4]]-én? (For all dates, of course) • Or July 13, 2012 and 13 July 2012 to 2012-07-13 and7/13/2012 to 2012-07-13 (ISO 8601) within tables? • Or color, Color, c/Colorful, c/Colorfulness to colour… (but not Colorado and colorectal cancer)?Note! Colorful (film) and (manga) and CSS colors go to exceptions!(Why? Sure? How to decide?)

  13. Regular expressions • Regular expressions form a simple programming language that searches for patterns and replaces with patterns. • Learn them, they are worth! Another dimension of efficiency.

  14. Example: search for a date July 13, 2012 (a regex-like analysis) • A month name (possibly in lower case or abbreviated as Jul) • One or more or less spaces • 1…9 OR 0 followed by 1…9 OR 1 or 2 followed by 0…9 OR 3 followed by 0 or 1 • Comma? • One or more or less spaces (not less without comma) • Maximum of four digits (1 and 2: are they worth?)

  15. Firsttheorem The more hits and the more precise matching you want, the more complex the regex will be. (Do you want to find july? Do you want to find July 13,2012? Do you want to find Jul 13, 2012?)

  16. Example: agents (search & replace) 'replacements': [ (ur'(FBI|CIA|KGB|MI ?\d) [üÜ]gynök(?!e)', ur'\1-ügynök'), (ur'(FBI|CIA|KGB|MI ?\d\]\]) [üÜ]gynök(?!e)', ur'\1-ügynök'), ], • An agency (MI followed by an optional space and a digit) • A space • Ügynök OR ügynök, but NOT ügynöke (hyphen prohibited) Second line: a linked agency Result: a hyphenated, lower case agent (=ügynök in Hungarian) NB it was preceeded by some searches! Not all agencies are here.

  17. Example: exceptions with regexes BaseExceptions = { 'inside-tags': [ 'hyperlink', 'interwiki', ], 'text-contains': [ ur'(?i)(\{\{szinnyei|\{\{pallas\}|\{\{fényes\}|\{\{vályi\}|Vályi András|Fényes Elek|\{\{sicc\})', ], 'inside': [ r'\{\{DEFAULTSORT:.*?\}\}', #A defaultsortban szándékosan ékezet nélküli szavak vannak. ur'<ref name.*?>', #Mindenféle idézősablonok: ur'(?is)\{\{cite.*?\}\}', #Az összes citenyavalya sablon (nem mindig van szóköz) ur'(?is)\{\{cit(lib|per).*?\}\}', #A CitLib és a CitPer (nem biztos a szóköz, lehet |) ur'(?is)\{\{citation .*?\}\}', ], 'title': [ ur'\d{4} a jogalkotásban', ], }

  18. What is to be excepted? • Keywords

  19. Advanced level • Fixes and functions – own Python functions

  20. Workflow

  21. Simple replacement tasks • Find an idea • Create the replacement • Find a good selector (search*, category…) • Do the work with two fingers(y/enter, then /enter)(asynchronous save!) • Imagine this and next slide is a flowchart.  *Unfortunately, no regexes in MediaWiki search engine 

  22. Advanced replacements tasks • Find an idea • Create the first version of replacement • Test it as usual in software development • Watch it working during collection • Create a test page with purposeful errors • Take care of [[link]]ed & [[link|piped]] versions! • Found falses? Missing replacements? Is it too slow? Are the previous problems solved as far as possible? Refine your regexes and/or exceptions • Press ctrl C, and da capo al fine • If the fix is good enough, begin the work. • Maintain fixes & exceptions continously

  23. Decisions Command line parameters or fix? Searching in live wiki or in dump? Search & replace in one run or separately? Simple text replacements or regular expressions? Manual or automatic running?

  24. Why manually? • Color as CSS property • % next to a number – may be an operation • Misspelled word – may be an example in a linguistic article or a quotation • RESPONSIBILITY!

  25. Second theorem Spelling corrections must be manually. Period.

  26. Semiautomatic running • Ingredients: • A replacement task that runs almost always correctly • One or more pizzas (depending on running time) (possibly a bottle of beer, if you like it) • Your favourite music • Stable knowledge of where your Pause button is

  27. Errors • False positives • Conflicts (originated from false positives) • Missed matches • Simply bad replacement expression • Slow fix • Inappropriate automatic running • Unneccessary changing because of fatigue • Unneccessary changing because of incompetence Change the bot owner! 

  28. Third theorem The more hits you want, the more conflicts you get. This is the game. Find the balance.

  29. Speed

  30. Speed • Complex fixes may run slower • Exceptions make it slower • Lookbehinds make it slower • Recursive run and allowoverlap are definitely slow (risk of infinite loop!) • Will be slow if the beginning of the expression has much more hits than the trailing (see examples in fixes.py)

  31. Speed Fast replacements take the titles from • -search • -cat & al • -links • -transcludes • -file etc.

  32. The two-pass model of replacement • Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic • Optionally uploading the list to your wiki • Making the actual replacementsFaster (or very fast) and attended

  33. Decisions Command line parameters or fix? Searching in live wiki or in dump? Search & replace in one run or separately? Simple text replacements or regular expressions? Manual or automatic running?

  34. Efficiency

  35. What does it mean? • Find as much occurrences as possible (even if agglutinated) • Find as few false positives as possible • Face as few correction conflicts as possible • Give the appropriate replacement always • Let the bot work quickly — don’t wait in front of the screen

  36. Keys to efficiency • If you find a very efficient replacement (near to 100%), do it separately before others in the same package – you will have less conflict (but you may collect them together) • Too big packages may run slow and have a greater chance to cause correction conflicts. Sometimes it is worth to make smaller parts of them. • Too small packages will use more dead time during preparation and execution. Sometimes it is worth to put them together. • How to decide then? Just watch. 

  37. Keys to efficiency Use exceptions when appropriate. They will decrease false positives as well as correction conflicts. E.g. Cite book, cite web, cite anything templates URLs, image names (even as template parameters and gallery images!) Templates marking pages out of your scope (old authors in Hungarian Wikipedia whose quotations contain old-style spelling) Titles marking pages out of your scope (year numbers in law in Hungarian Wikipedia) …and first of all: improve your regexes continously!

  38. Keys to efficiency Once you found a false positive, save it for later use!-saveexc / -saveexcnew Then insert these titles into your exceptions. Run searches before/during creation of a fix. Don’t deal with tasks that are not worth a bot! Use the two-pass model and the dump whenever possible!

  39. An ugly example I have a fix to correct short and long i (i/í). Argentína has an í, but often occurs in English and Spanish titles  no regex for it, title exceptions must be used  separate fix. But they may be collected together.

  40. A less ugly example • replace.py ásnéven "ás néven" -search:másnéven -ns:0 -summary:"Helyesírás javítása kézi botszerkesztéssel: más néven„ •  live demo

  41. Character encoding problems Keep your files in UTF-8, and don’t use Notepad of Windows E.g. setting in Notepad++:

  42. Character encoding problems If it doesn’t work in command line, write a fix If you can’t solve with a fix, use URL encoding replace.py -catr:Венгрия . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt -always replace.py -catr:%D0%92%D0%B5%D0%BD%D0%B3%D1%80%D0%B8%D1%8F . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt –always live demo You may store this in a script (import replace.py) This is the way of page collections 

  43. Page collections

  44. The two-pass model of replacement • Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic • Optionally uploading the list to your wiki • Making the actual replacementsFaster (or very fast) and attended

  45. A simple idea • Gathering candidates (possible to-be-replaced texts) to a fileRelatively slow and automatic • Uploading the list to your wiki (this is the result!) • Nothing. You are ready.

  46. Some ideas for page collections • Scheme: some existing/missing text • Articles related to Hungary in other Wikipedias (see above for ruwiki) • The Redlist Project for animals and plants • Articles with {{commons}} template, but without any image • …let your phantasy go!

  47. Useful links [[meta:User:Bináris]] Thank you for your attention!

  48. PS – some thoughts months later • Lookahead is faster than recursion or overlapping. • If a function is called for each much, that makes the bot run really slowly. • In such cases a separate „fellow fix” without function call for searching is useful for faster search.

More Related