140 likes | 310 Views
LIANZA ITSIG webinar series. Text Editing. Tools, tips, tricks. Kim Shepherd k.shepherd@auckland.ac.nz Digital Development Team The University of Auckland Library. Summary. General (large) text files We manage and manipulate text data daily It’s tedious and time consuming
E N D
LIANZA ITSIG webinar series Text Editing Tools, tips, tricks Kim Shepherd k.shepherd@auckland.ac.nz Digital Development Team The University of Auckland Library
Summary • General (large) text files • We manage and manipulate text data daily • It’s tedious and time consuming • Find & Replace is too limited and dangerous • We know there must be a better way... • Tabular data files (eg. Spreadsheets) • We work with these all the time, usually in Excel • What tools can help us clean messy data?
Topics • Regular Expressions • Text Editors • Operating on lines, not entire files • Google Refine
Regular Expressions /^\s+[a-zA-Z0-9](?:\W+)/
Regular Expressions • A way to describe a set of strings and capture parts of them • Originated in old UNIX/POSIX tools • Now used all over the place • Test your regexes out on the web: • http://gskinner.com/RegExr/
Text Editors & Useful Languages sed, grep, awk
Text Editors • Word processors aren’t text editors • Shop around, compare features • My favourite: Vim (UNIX, Windows, Mac) • Wikipedia comparison of editor features • Wikipedia list of regex software
Useful Languages / Interpeters • Perl • An old favourite, great for string manipulation • Python • The cool kids tell me it’s better than Perl • GREL • We’ll get to this later...
Line-by-line processing while(<STDIN>) { .... }
Line-by-line processing • Large files are large! • If they’re big on disk, they’ll be big in memory • Lines are (usually!) small • Read a line • Do something with it • Output the modified line
Google Refine • Cleans messy tabular data • Easy facetting and filtering of columns/values • Easy transformation of values • Google Refine Expression Language (GREL) • Extensive use of regular expressions and other standard string manipulation techniques • Other features • Perform web service calls directly, reconcile row IDs
Conclusion • Our problems are solvable! • Regular expressions • Decent text editors for general/unformatted text • Google Refine for tabular data • Contact me • Please feel free to contact me with questions, corrections or ideas • k.shepherd@auckland.ac.nz • Twitter: @kimshepherd • Google+: kim.shepherd@gmail.com