1 / 34

The Games Corpus

This project, undertaken by Agustín Gravano at Columbia University, focuses on the design, implementation, and annotation of "The Games Corpus." The corpus consists of recordings of participants playing three different computer games, allowing for the study of various aspects of spoken language processing and discourse. The corpus is meticulously annotated, enabling researchers to investigate the relationship between intonation patterns and information status, syntactic and discourse position, and more. The project also addresses challenges related to subject recruitment and recording logistics.

velmam
Download Presentation

The Games Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

  2. The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University

  3. The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University

  4. Experiment Design • Goal: Study the relation between the down-stepped contour and • Information status • Syntactic position • Discourse position • Spontaneous speech • Both monologue and dialogue "The Games Corpus" - Agustín Gravano - Columbia University

  5. Experiment Design • Three computer games. • Two players, each on a different computer. • They collaborate to perform a common task. • Totally unrestricted speech. "The Games Corpus" - Agustín Gravano - Columbia University

  6. Cards Game #1  Player 1 (Describer)  Player 2 (Searcher) • Short monologues • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University

  7. Cards Game #2  Player 1 (Describer)  Player 2 (Searcher) • Dialogue • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University

  8. Objects Game  Player 1 (Describer)  Player 2 (Searcher) • Dialogue • Vary target and surrounding objects (subject and object position). "The Games Corpus" - Agustín Gravano - Columbia University

  9. Games Session • Repeat 3 times: • Cards Game #1 • Cards Game #2 • Short break (optional) • Repeat 3 times: • Objects Game • Each subject participated in 2 sessions. • 12 sessions "The Games Corpus" - Agustín Gravano - Columbia University

  10. Subjects • Postings: • Columbia’s webpage for temporary job adds. • Craig’s list • http://www.craigslist.org • Category: Gigs  Event gigs • Problem: • People are unreliable • ~50% did not show up, or cancelled with short notice. "The Games Corpus" - Agustín Gravano - Columbia University

  11. Subjects • Possible solutions: • Give precise instructions to e-mail ALL required info: • Name, native speaker?, hearing impairments?, etc. • Ask for a phone number. • Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). • Increase the pay after each session. • Example: $5, $10, $15 instead of $10, $10, $10. "The Games Corpus" - Agustín Gravano - Columbia University

  12. Recording • Sound-proof booth • 2 subjects + 1 or 2 confederates. • Head-mounted mics. • Digital Audio Tape (DAT): one channel per speaker. • Wav files • One mono file per speaker. • Sample rate: 48000 • Downsampled to 16000 (but kept original files!) • ~20 hours of speech  2.8 GB (16k) "The Games Corpus" - Agustín Gravano - Columbia University

  13. Logs • Log everything the subjects do to a text file. • Example: 17:03:55:234 BEGIN_EXECUTION 17:04:04:868 NEXT_TURN 17:04:31:837 RESULTS 97 points awarded. 17:04:38:426 NEXT_TURN 17:05:03:873 RESULTS 92 points awarded. ... • Later, this may be used (e.g.) to divide each session into smaller tasks or conversations. "The Games Corpus" - Agustín Gravano - Columbia University

  14. The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University

  15. Speech Processing Tools • Praat • http://www.praat.org • WaveSurfer • http://www.speech.kth.se/wavesurfer • Transcriber • http://trans.sourceforge.net "The Games Corpus" - Agustín Gravano - Columbia University

  16. Orthographic Tier - Method 1 "The Games Corpus" - Agustín Gravano - Columbia University

  17. Orthographic Tier - Method 1 • Problems • Very stressing • Time consuming • Separate transcription from alignment. "The Games Corpus" - Agustín Gravano - Columbia University

  18. Orthographic Tier - Method 2 • Transcribe chunks using a web interface. "The Games Corpus" - Agustín Gravano - Columbia University

  19. Orthographic Tier - Method 2 • Transcribe chunks using a web interface. • Align each chunk automatically. • Concatenate all chunks. • Correct the alignment by hand using Praat, Wavesurfer or similar. "The Games Corpus" - Agustín Gravano - Columbia University

  20. Orthographic Tier - Method 2 • Advantages • Transcription task is very comfortable. • Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. • Problems • Overhead: chunking, automatic alignment, concat. • Error prone! Easy for humans to overlook errors in the automatic alignment. "The Games Corpus" - Agustín Gravano - Columbia University

  21. Orthographic Tier - Method 3 • Transcribe the whole file, using: • a regular audio player (e.g., Windows Media Player), and • a regular plain-text editor (e.g., Notepad). • Use Wavesurfer to align the words. • “Load text labels” function • Check out: • Spectrogram settings • Customizable shortcuts "The Games Corpus" - Agustín Gravano - Columbia University

  22. Orthographic Tier • Transcription guidelines • capital letters • abbreviations • disfluencies • mmhm, uhhuh, gotcha, etc. • Alignment guidelines • boundaries • http://www.cs.columbia.edu/~agus/games • username/password = speech/lions "The Games Corpus" - Agustín Gravano - Columbia University

  23. Too many cooks… • Concurrency problem • File locking webpage • Annotators lock a file before working on it, and release it when done. "The Games Corpus" - Agustín Gravano - Columbia University

  24. Annotation: Cue Words • okay, mmhm, uhhuh, right, etc. • Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. • Developed an ad-hoc application in Java. • Bad idea!!! Too long development time. • Instead, use Praat (or other general-purpose tool). • For simple, specific tasks, Praat is not difficult to learn. • Create a file with empty points at the middle point of the words that need to be labeled. • Annotators only label those words, safely ignoring the rest. "The Games Corpus" - Agustín Gravano - Columbia University

  25. Other Annotations • Turn switches • Smooth switches, interruptions, backchannels, etc. • The labeler received a Praat file with empty turns. • Prosody • ToBI Labeling Conventions: Tones and Break Indices. • Questions • Identification, form and function. "The Games Corpus" - Agustín Gravano - Columbia University

  26. Guidelines for Guidelines • Web based (password protected) • Highlight recent changes • Avoid long lists: categorize, trees. "The Games Corpus" - Agustín Gravano - Columbia University

  27. Files • games/data/session_NN/sNN.GAME.P.Y.ext • NN= 01..12 • GAME = {cards, objects} • P = 0..3 if GAME=cards, 0..1 if GAME=objects • Y = {A, B} • ext = {wav, words, tones, breaks, misc, turns, …} "The Games Corpus" - Agustín Gravano - Columbia University

  28. Files • Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/… "The Games Corpus" - Agustín Gravano - Columbia University

  29. Files Format • All files (except *.wav) are saved as plain text, with the WaveSurfer format: • Start End Value (for interval tiers) • Time Value (for point tiers) • Advantages • Human-readable. • Very easy to process. • Problems • Consistency • Rounding "The Games Corpus" - Agustín Gravano - Columbia University

  30. Files Format • Other formats: • XML • General-purpose mark-up language. • <TAG attribute=“value”> … </TAG> • Solves problems like consistency and rounding. • Not human-readable, harder to process. • Praat • Not human-readable, hard to process. • Also has the consistency problem. "The Games Corpus" - Agustín Gravano - Columbia University

  31. Scripts • So far, we have needed dozens of Perl scripts. • Examples: • Convert between Praat and WaveSurfer formats. • Create a Praat file with empty CW labels, turns, etc. • Find typos, missing labels, and other errors. • Unify notation (e.g., “mm-hmm”  “mmhm”). • Check consistency of files. • … "The Games Corpus" - Agustín Gravano - Columbia University

  32. Back-up! • Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). • Back-up everything else (plain text: light) periodically, and automatically. • Configure “cron” to make a backup copy every 8 hours. "The Games Corpus" - Agustín Gravano - Columbia University

  33. Timeline • Orthographic tier first! time design+implem. orthographic tier prosody (ToBI) cue words turn switches "The Games Corpus" - Agustín Gravano - Columbia University

  34. The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

More Related