340 likes | 566 Views
The Games Corpus. Design, implementation and annotation. Agust ín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University. The Games Corpus. Design and Implementation Annotation. The Games Corpus. Design and Implementation Annotation. Experiment Design.
E N D
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
Experiment Design • Goal: Study the relation between the down-stepped contour and • Information status • Syntactic position • Discourse position • Spontaneous speech • Both monologue and dialogue "The Games Corpus" - Agustín Gravano - Columbia University
Experiment Design • Three computer games. • Two players, each on a different computer. • They collaborate to perform a common task. • Totally unrestricted speech. "The Games Corpus" - Agustín Gravano - Columbia University
Cards Game #1 Player 1 (Describer) Player 2 (Searcher) • Short monologues • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University
Cards Game #2 Player 1 (Describer) Player 2 (Searcher) • Dialogue • Vary frequency and order of occurrence of objects on the cards. "The Games Corpus" - Agustín Gravano - Columbia University
Objects Game Player 1 (Describer) Player 2 (Searcher) • Dialogue • Vary target and surrounding objects (subject and object position). "The Games Corpus" - Agustín Gravano - Columbia University
Games Session • Repeat 3 times: • Cards Game #1 • Cards Game #2 • Short break (optional) • Repeat 3 times: • Objects Game • Each subject participated in 2 sessions. • 12 sessions "The Games Corpus" - Agustín Gravano - Columbia University
Subjects • Postings: • Columbia’s webpage for temporary job adds. • Craig’s list • http://www.craigslist.org • Category: Gigs Event gigs • Problem: • People are unreliable • ~50% did not show up, or cancelled with short notice. "The Games Corpus" - Agustín Gravano - Columbia University
Subjects • Possible solutions: • Give precise instructions to e-mail ALL required info: • Name, native speaker?, hearing impairments?, etc. • Ask for a phone number. • Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). • Increase the pay after each session. • Example: $5, $10, $15 instead of $10, $10, $10. "The Games Corpus" - Agustín Gravano - Columbia University
Recording • Sound-proof booth • 2 subjects + 1 or 2 confederates. • Head-mounted mics. • Digital Audio Tape (DAT): one channel per speaker. • Wav files • One mono file per speaker. • Sample rate: 48000 • Downsampled to 16000 (but kept original files!) • ~20 hours of speech 2.8 GB (16k) "The Games Corpus" - Agustín Gravano - Columbia University
Logs • Log everything the subjects do to a text file. • Example: 17:03:55:234 BEGIN_EXECUTION 17:04:04:868 NEXT_TURN 17:04:31:837 RESULTS 97 points awarded. 17:04:38:426 NEXT_TURN 17:05:03:873 RESULTS 92 points awarded. ... • Later, this may be used (e.g.) to divide each session into smaller tasks or conversations. "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus • Design and Implementation • Annotation "The Games Corpus" - Agustín Gravano - Columbia University
Speech Processing Tools • Praat • http://www.praat.org • WaveSurfer • http://www.speech.kth.se/wavesurfer • Transcriber • http://trans.sourceforge.net "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 1 "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 1 • Problems • Very stressing • Time consuming • Separate transcription from alignment. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Transcribe chunks using a web interface. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Transcribe chunks using a web interface. • Align each chunk automatically. • Concatenate all chunks. • Correct the alignment by hand using Praat, Wavesurfer or similar. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 2 • Advantages • Transcription task is very comfortable. • Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. • Problems • Overhead: chunking, automatic alignment, concat. • Error prone! Easy for humans to overlook errors in the automatic alignment. "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier - Method 3 • Transcribe the whole file, using: • a regular audio player (e.g., Windows Media Player), and • a regular plain-text editor (e.g., Notepad). • Use Wavesurfer to align the words. • “Load text labels” function • Check out: • Spectrogram settings • Customizable shortcuts "The Games Corpus" - Agustín Gravano - Columbia University
Orthographic Tier • Transcription guidelines • capital letters • abbreviations • disfluencies • mmhm, uhhuh, gotcha, etc. • Alignment guidelines • boundaries • http://www.cs.columbia.edu/~agus/games • username/password = speech/lions "The Games Corpus" - Agustín Gravano - Columbia University
Too many cooks… • Concurrency problem • File locking webpage • Annotators lock a file before working on it, and release it when done. "The Games Corpus" - Agustín Gravano - Columbia University
Annotation: Cue Words • okay, mmhm, uhhuh, right, etc. • Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. • Developed an ad-hoc application in Java. • Bad idea!!! Too long development time. • Instead, use Praat (or other general-purpose tool). • For simple, specific tasks, Praat is not difficult to learn. • Create a file with empty points at the middle point of the words that need to be labeled. • Annotators only label those words, safely ignoring the rest. "The Games Corpus" - Agustín Gravano - Columbia University
Other Annotations • Turn switches • Smooth switches, interruptions, backchannels, etc. • The labeler received a Praat file with empty turns. • Prosody • ToBI Labeling Conventions: Tones and Break Indices. • Questions • Identification, form and function. "The Games Corpus" - Agustín Gravano - Columbia University
Guidelines for Guidelines • Web based (password protected) • Highlight recent changes • Avoid long lists: categorize, trees. "The Games Corpus" - Agustín Gravano - Columbia University
Files • games/data/session_NN/sNN.GAME.P.Y.ext • NN= 01..12 • GAME = {cards, objects} • P = 0..3 if GAME=cards, 0..1 if GAME=objects • Y = {A, B} • ext = {wav, words, tones, breaks, misc, turns, …} "The Games Corpus" - Agustín Gravano - Columbia University
Files • Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/… "The Games Corpus" - Agustín Gravano - Columbia University
Files Format • All files (except *.wav) are saved as plain text, with the WaveSurfer format: • Start End Value (for interval tiers) • Time Value (for point tiers) • Advantages • Human-readable. • Very easy to process. • Problems • Consistency • Rounding "The Games Corpus" - Agustín Gravano - Columbia University
Files Format • Other formats: • XML • General-purpose mark-up language. • <TAG attribute=“value”> … </TAG> • Solves problems like consistency and rounding. • Not human-readable, harder to process. • Praat • Not human-readable, hard to process. • Also has the consistency problem. "The Games Corpus" - Agustín Gravano - Columbia University
Scripts • So far, we have needed dozens of Perl scripts. • Examples: • Convert between Praat and WaveSurfer formats. • Create a Praat file with empty CW labels, turns, etc. • Find typos, missing labels, and other errors. • Unify notation (e.g., “mm-hmm” “mmhm”). • Check consistency of files. • … "The Games Corpus" - Agustín Gravano - Columbia University
Back-up! • Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). • Back-up everything else (plain text: light) periodically, and automatically. • Configure “cron” to make a backup copy every 8 hours. "The Games Corpus" - Agustín Gravano - Columbia University
Timeline • Orthographic tier first! time design+implem. orthographic tier prosody (ToBI) cue words turn switches "The Games Corpus" - Agustín Gravano - Columbia University
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University