110 likes | 130 Views
A system designed by the Language of Citizen Science research group at the University of Portsmouth for downloading, preparing, and managing linguistic data from online Citizen Science forums such as Zooniverse. The method involves a UNIX-based approach for compiling corpora, automation with xdotool, and a management system on Google Drive. The system offers a quick, flexible, and free solution using open-source tools that facilitate corpus analysis with various software. It encourages corpus linguists to gain coding skills and maintain an orderly progress record.
E N D
‘Capturing the Zoo’A system for downloading, preparing, and managing corpus data from online forums John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_
Who are we and what are we doing? • Language of Citizen Science (LOCS) research group, University of Portsmouth • ‘Citizen science’ (CS) = online collaboration between scientists and members of the public who volunteer to take part in research; the ‘crowdsourcing’ of scientific research • 7 researchers in different areas of linguistics linked to CS researchers in other departments (Economics, Cosmology) • Overarching research questions: what are the factors that motivate and demotivate volunteers from taking part in CS projects? What part does language play in this? • Our task is to capture and interrogate linguistic data from online CS forums Zooniverse
Zooniverse • Umbrella site • Over 40 projects in different scientific domains • Each project includes a ‘Talk’ section (online discussion forum)
Corpus building • Our team was tasked with compiling corpora from the Talk sections for each of the Zooniverse projects • UNIX-based approach to downloading and cleaning the data (Linux) • Separate corpora for each forum thread • Some threads are short (micro corpora) 1 page, 1 reply • Some threads are very long 7014 pages, 105200 replies • From this we can compile larger corpora: • A corpus for each project (macro corpus) • User-specific corpora • Themed corpora across projects (e.g. introductions, general chat threads)
Challenges for methodology • Different researchers with converging yet distinct research interests • Constantly increasing number of Zooniverse projects, threads and posts • No predictable system for numerically identifying threads in URLs • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ10066h4 • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ0001lf1 • Three different forum formats (type_0, type_1, type_2) • type_1 and type_2 are ‘enhanced’ by JavaScript: the content cannot be downloaded with simple UNIX commands, e.g. wget, lynx
Forum formats type_0 type_2 type_1
Problem: JavaScript blocks content type_1 thread to download Output of lynx command However, manual copying and pasting works…
Solution: xdotool • http://www.semicomplete.com/projects/xdotool/ • Open-source package for Linux or Mac that simulates keystrokes and mouse movements • Therefore xdotool can automate copying and pasting • Ctrl+A Ctrl+C Ctrl+V (select all, copy, paste) • Can be incorporated into UNIX shell scripts • Download data clean up data save to corpus .txt files • Clean up = remove boilerplate text and superfluous metadata + tagging poster and timestamp information • Can open and close browser pages (indeed any windows) • Possible Windows equivalent: autohotkey • https://autohotkey.com/
Where does the script get its input from? • Management system on Google Drive spreadsheet • Shared with all team members who can request threads for download The selected cells can be used directly as input to the download program
How the whole system works • A team member requests one or more threads for download by entering details in the spreadsheet • The same or any team member can select multiple rows and use them as input (arguments) to the download program by pasting them into a .txt file • A record for progress in downloading corpora • Team member who downloads the threads uploads the resulting corpora to central repository (Google Drive shared folder accessible to all team members) • Corpora can then be analysed using any corpus software (AntConc, WordSmith, SketchEngine etc.)
Why this methodology? • Quick, consistent and flexible method for download • It is free, and uses only open-source tools • The scripts are adaptable good starting point for future projects • The management system (Google Drive) is orderly, easily accessible, and a useful record of progress • Compatible with both large and small corpora • Encourages corpus linguists to gently acquire some coding skills (cf. BAAL Symposium @ Aston, May 6th)