340 likes | 357 Views
This project, undertaken by Agustín Gravano at Columbia University, focuses on the design, implementation, and annotation of "The Games Corpus." The corpus consists of recordings of participants playing three different computer games, allowing for the study of various aspects of spoken language processing and discourse. The corpus is meticulously annotated, enabling researchers to investigate the relationship between intonation patterns and information status, syntactic and discourse position, and more. The project also addresses challenges related to subject recruitment and recording logistics.
E N D
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
Experiment Design • Goal: Study the relation between the down-stepped contour and • Information status • Syntactic position • Discourse position • Spontaneous speech • Both monologue and dialogue "The Games Corpus" - Agustín Gravano - Columbia University
Experiment Design • Three computer games. • Two players, each on a different computer. • They collaborate to perform a common task. • Totally unrestricted speech. "The Games Corpus" - Agustín Gravano - Columbia University
Cards Game #1 Player 1 (Describer) Player 2 (Searcher) • Short monologues • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University
Cards Game #2 Player 1 (Describer) Player 2 (Searcher) • Dialogue • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University
Objects Game Player 1 (Describer) Player 2 (Searcher) • Dialogue • Vary target and surrounding objects (subject and object position). "The Games Corpus" - Agustín Gravano - Columbia University
Games Session • Repeat 3 times: • Cards Game #1 • Cards Game #2 • Short break (optional) • Repeat 3 times: • Objects Game • Each subject participated in 2 sessions. • 12 sessions "The Games Corpus" - Agustín Gravano - Columbia University
Subjects • Postings: • Columbia’s webpage for temporary job adds. • Craig’s list • http://www.craigslist.org • Category: Gigs Event gigs • Problem: • People are unreliable • ~50% did not show up, or cancelled with short notice. "The Games Corpus" - Agustín Gravano - Columbia University
Subjects • Possible solutions: • Give precise instructions to e-mail ALL required info: • Name, native speaker?, hearing impairments?, etc. • Ask for a phone number. • Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). • Increase the pay after each session. • Example: $5, $10, $15 instead of $10, $10, $10. "The Games Corpus" - Agustín Gravano - Columbia University
Recording • Sound-proof booth • 2 subjects + 1 or 2 confederates. • Head-mounted mics. • Digital Audio Tape (DAT): one channel per speaker. • Wav files • One mono file per speaker. • Sample rate: 48000 • Downsampled to 16000 (but kept original files!) • ~20 hours of speech 2.8 GB (16k) "The Games Corpus" - Agustín Gravano - Columbia University
Logs • Log everything the subjects do to a text file. • Example: 17:03:55:234 BEGIN_EXECUTION 17:04:04:868 NEXT_TURN 17:04:31:837 RESULTS 97 points awarded. 17:04:38:426 NEXT_TURN 17:05:03:873 RESULTS 92 points awarded. ... • Later, this may be used (e.g.) to divide each session into smaller tasks or conversations. "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
Speech Processing Tools • Praat • http://www.praat.org • WaveSurfer • http://www.speech.kth.se/wavesurfer • Transcriber • http://trans.sourceforge.net "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 1 "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 1 • Problems • Very stressing • Time consuming • Separate transcription from alignment. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Transcribe chunks using a web interface. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Transcribe chunks using a web interface. • Align each chunk automatically. • Concatenate all chunks. • Correct the alignment by hand using Praat, Wavesurfer or similar. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Advantages • Transcription task is very comfortable. • Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. • Problems • Overhead: chunking, automatic alignment, concat. • Error prone! Easy for humans to overlook errors in the automatic alignment. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 3 • Transcribe the whole file, using: • a regular audio player (e.g., Windows Media Player), and • a regular plain-text editor (e.g., Notepad). • Use Wavesurfer to align the words. • “Load text labels” function • Check out: • Spectrogram settings • Customizable shortcuts "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier • Transcription guidelines • capital letters • abbreviations • disfluencies • mmhm, uhhuh, gotcha, etc. • Alignment guidelines • boundaries • http://www.cs.columbia.edu/~agus/games • username/password = speech/lions "The Games Corpus" - Agustín Gravano - Columbia University
Too many cooks… • Concurrency problem • File locking webpage • Annotators lock a file before working on it, and release it when done. "The Games Corpus" - Agustín Gravano - Columbia University
Annotation: Cue Words • okay, mmhm, uhhuh, right, etc. • Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. • Developed an ad-hoc application in Java. • Bad idea!!! Too long development time. • Instead, use Praat (or other general-purpose tool). • For simple, specific tasks, Praat is not difficult to learn. • Create a file with empty points at the middle point of the words that need to be labeled. • Annotators only label those words, safely ignoring the rest. "The Games Corpus" - Agustín Gravano - Columbia University
Other Annotations • Turn switches • Smooth switches, interruptions, backchannels, etc. • The labeler received a Praat file with empty turns. • Prosody • ToBI Labeling Conventions: Tones and Break Indices. • Questions • Identification, form and function. "The Games Corpus" - Agustín Gravano - Columbia University
Guidelines for Guidelines • Web based (password protected) • Highlight recent changes • Avoid long lists: categorize, trees. "The Games Corpus" - Agustín Gravano - Columbia University
Files • games/data/session_NN/sNN.GAME.P.Y.ext • NN= 01..12 • GAME = {cards, objects} • P = 0..3 if GAME=cards, 0..1 if GAME=objects • Y = {A, B} • ext = {wav, words, tones, breaks, misc, turns, …} "The Games Corpus" - Agustín Gravano - Columbia University
Files • Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/… "The Games Corpus" - Agustín Gravano - Columbia University
Files Format • All files (except *.wav) are saved as plain text, with the WaveSurfer format: • Start End Value (for interval tiers) • Time Value (for point tiers) • Advantages • Human-readable. • Very easy to process. • Problems • Consistency • Rounding "The Games Corpus" - Agustín Gravano - Columbia University
Files Format • Other formats: • XML • General-purpose mark-up language. • <TAG attribute=“value”> … </TAG> • Solves problems like consistency and rounding. • Not human-readable, harder to process. • Praat • Not human-readable, hard to process. • Also has the consistency problem. "The Games Corpus" - Agustín Gravano - Columbia University
Scripts • So far, we have needed dozens of Perl scripts. • Examples: • Convert between Praat and WaveSurfer formats. • Create a Praat file with empty CW labels, turns, etc. • Find typos, missing labels, and other errors. • Unify notation (e.g., “mm-hmm” “mmhm”). • Check consistency of files. • … "The Games Corpus" - Agustín Gravano - Columbia University
Back-up! • Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). • Back-up everything else (plain text: light) periodically, and automatically. • Configure “cron” to make a backup copy every 8 hours. "The Games Corpus" - Agustín Gravano - Columbia University
Timeline • Orthographic tier first! time design+implem. orthographic tier prosody (ToBI) cue words turn switches "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University