50 likes | 199 Views
Update on Transcription of Fisher Phase II Data. Owen Kimball, Chia-lin Kao, Tresi Arvizo, John Makhoul. Current Transcription Effort . Transcribing 1400 hours of Fisher data, including ~1560 calls from Phase I collection ~6840 calls from Phase II (more recent) collection
E N D
Update on Transcription of Fisher Phase II Data Owen Kimball, Chia-lin Kao, Tresi Arvizo, John Makhoul
Current Transcription Effort • Transcribing 1400 hours of Fisher data, including • ~1560 calls from Phase I collection • ~6840 calls from Phase II (more recent) collection • Phase II collection replaces original 40 topics with expanded set of 69 (http://www.ldc.upenn.edu/Fisher/new_topics.html) • As before, WordWave transcribing, BBN post processing • From the 1400 hours, LDC will hold back calls that include any speaker or phone number that overlaps with test sets that NIST has defined
New Transcription Guidelines • Eliminated incorrect forms (e.g. British spellings) from dictionary used to filter transcripts • Changes to Style Guide to clarify items that led to inconsistencies • Primarily to increase efficiency of manual post processing • Added [BN] and [/BN] for sustained background noise • Changes to punctuation guidelines to support better future Rich Transcription research • Clarification of double dash (“--”) for discontinuities • Ellipsis (“…”) to indicate continued speaking across interrupt
Sample Transcript, Revised Style Guide R: Yeah. And then when you're reading it, you know, it's like, okay, um, you know, people -- people still view things different. L: Right. R: You know? We could be reading the same thing and -- and see it two different ways and ... L: Oh, obviously. R: ... he shouldn't have said that. [LAUGH] But -- and see I don't -- I don't get the newspaper at all. I just -- L: Yeah. Unfortunately I have to say I don't really either. R: I don't -- ... L: I used to. R: ... I don't even have time to even sit down and ... L: [LAUGH] R: ... you know, really read a newspaper, you know? [LAUGH] L: Right. R: [SIGH] Everything has gotten to be so quick that you can't, you know --?
Current Status • Sent 492 hours of processed transcripts to LDC on 12/2/04 • LDC released 465 hours of this in Feb 05 • As of 3/15/05 • 1288 hours (7734 conversations) received from WordWave • 1055 hours (6631 conversations) post processed by BBN • WordWave is committed to finishing by end of March 05 • BBN has reserved EARS funding to finish post processing • Will send to LDC as soon as all transcripts processed • Hoping for mid-April 05