1 / 75

Unlocking Speech Archives: Challenges and Solutions in Data Mining

Explore the challenges and solutions in working with vast collections of spoken language data. Discover how researchers navigate large audio corpora and make them accessible for analysis and exploration. Find out about the year-long "Digging into Data" project and the advancements in speech data research.

bellom
Download Presentation

Unlocking Speech Archives: Challenges and Solutions in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining a Year of Speech:a “Digging into Data” projecthttp://www.phon.ox.ac.uk/mining/

  2. Mining (a) Year(s) of Speech:a “Digging into Data” projecthttp://www.phon.ox.ac.uk/mining/

  3. John Coleman Greg Kochanski Ladan Ravary Sergio GrauOxford University Phonetics LaboratoryLou BurnardJonathan RobinsonThe British Library

  4. Mark Liberman Jiahong YuanChris Cieri Phonetics Laboratoryand Linguistic Data ConsortiumUniversity of Pennsylvania

  5. with support from our“Digging into Data” competition funders and with thanks for pump-priming support from the Oxford University John Fell Fund, and from the British Library

  6. The “Digging into Data” challenge “The creation of vast quantities of Internet accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ... With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.”

  7. In “Mining a Year of Speech” we addressed the challenges of working with very large audio collections of spoken language.

  8. Challenges of very large audio collections of spoken language How does a researcher find audio segments of interest? How do audio corpus providers mark them up to facilitate searching and browsing? How to make very large scale audio collections accessible?

  9. Challenges Amount of material Storage CD quality audio: 635 MB/hour Uncompressed .wav files: 115 MB/hour 2.8 GB/day 85 GB/month 1.02 TB/year Library/archive .wav files: 1 GB/hr, 9 TB/yr Spoken audio = 250times XML

  10. Challenges • Storing 1.02 TB/year: not really a problem in 21st century • 1 TB (1000 GB) hard drive: c. £65 Now £39.95! • Computing (distance measures, alignments, labels etc): multiprocessor cluster ---

  11. Challenges Amount of material Computing distance measures, etc. alignment of labels searching and browsing Just reading or copying 9 TB takes >1 day Download time: days or weeks

  12. Challenges To make large corpora practical, you need: A detailed index, so users can find the parts they need A way of using the index to access slices of the corpus ? <w c5="AV0" hw="well" pos="ADV" >Well </w>

  13. Potential users Members of public interested in specific bits of content Scientists with broader interests, e.g. law scholars, political scientists: text searches Phoneticians and speech engineers: retrieval based on pronunciation and sound

  14. Searching audio: some kinds of questions you might ask 1. When did X say Y? For example, "find the video clip where George Bush said 'read my lips'." 2. Are there changes in dialects, or in their social status, that are tied to the new social media? 3. How do arguments work? For example, how do different people handle interruptions? 4. How frequent are linguistic features such as phrase-final rising intonation ("uptalk") across different age groups, genders, social classes, and regions?

  15. Some large(ish) speech corpora SwitchBoard corpus: 13 days of audio. Spoken Dutch Corpus: 1 month, but only a fraction is phonetically transcribed. Spoken Spanish: 4.6 days, orthographically transcribed. Buckeye Corpus (OSU): c. 2 days. Wellington Corpus of Spoken New Zealand English, c. 3 days transcribed Digital Archive of Southern Speech (American)

  16. The “Year of Speech” A grove of corpora, held at various sites with a common indexing scheme and search tools US English material: 2,240 hrs of telephone conversations 1,255 hrs of broadcast news As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates British English: Spoken part of the British National Corpus, >7.4 million words of transcribed speech Recently digitized by collaboration with British Library

  17. How big is “big science”? Human genome: 3 GB DASS audio sampler: 350 GB Hubble space telescope: 0.5 TB/year Year of Speech: >1 TB Sloan digital sky survey: 16 TB Beazley Archive of ancient Artifacts: 20 TB Large Hadron Collider: 15 PB/year = 2500 x Year of Speech

  18. How big is “big science”? Human genome: 3 GB DASS audio sampler: 350 GB Hubble space telescope: 0.5 TB/year Year of Speech: >1 TB Sloan digital sky survey: 16 TB Beazley Archive of ancient Artifacts: 20 TB Large Hadron Collider: 15 PB/year = 2500 x Year of Speech -------------- humanities

  19. Analogue audio in libraries British Library: >1m disks and tapes, 5% digitized Library of Congress Recorded Sound Reference Center: >2m items, including … International Storytelling Foundation: >8000 hrs of audio and video European broadcast archives: >20m hrs (2,283 years) cf. Large Hadron Collider 75% on ¼” tape 20% shellac and vinyl 7% digital

  20. Analogue audio in libraries World wide: ~100m hours (11,415 yrs) analogue i.e. 4-5 Large Hadron Colliders! Cost of professional digitization: ~£20/$32 per tape (e.g. C-90 cassette) Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time-consuming human listening

  21. Why so large? Lopsided sparsity I Top ten words each occur You 58,000 times it the 's and n't a That 12,400 words (23%) only Yeah occur once

  22. Why so large? Lopsided sparsity

  23. Why so large? Lopsided sparsity

  24. Lopsided sparsity and size Fox and Robles (2010): 22 examples of It's like-enactments [e.g. it's like 'mmmmmm'] in 10 hours of data

  25. Lopsided sparsity and size Rare phonetic word-joins I'm trying 60 tokens per ~10 million seem/n to 310 alarng clock 18 swimmim pool 44 gettim paid 19 weddim present 15 7 of the 'swimming pool' examples are from one family on one occasion

  26. Lopsided sparsity and size Final -t/-d deletion: just 19563 tokens want 5221 left 432 slammed 6

  27. A rule of thumb To catch most English sounds, you need minutes of audio common words of English … a few hours a typical person's vocabulary … >100 hrs pairs of common words … >1000 hrs arbitrary word-pairs … >100 years

  28. Rare and unique wonders aqualunging boringest chambermaiding de-grandfathered europeaney gronnies hoptastic lawnmowing mellies noseless punny regurgitate-arianism scunny smackerooney tooked weppings yak-chucker zombieness

  29. Not just repositories of words Specific phrases or constructions Particularities of people's voices and speaking habits Dog-directed speech Parrot-directed speech

  30. Language(?) in the wild A parrot Talking to a dog Try transcribing this! There’s gronnies lurking about

  31. Unusual voices Circumstances of use How is the 'voice' selected? Do men do it more than women? Young more than old? How do the speaker's and listener's brains produce, interpret or store “odd voice” pronunciations and strange intonations?

  32. Main problem in large corpora Finding needles in the haystack To address that challenge, we think there are two “killer apps” Forced alignment Data linking: open exposure of digital material, coupled with cross-searching

  33. Collaboration, not collection Search interface 1 (e.g. Oxford) Spoken LDC recordings - various locations LDC database - retrieve time stamps Search interface 2 (e.g. BL) BNC-XML database - retrieve time stamps Spoken BNC recordings - BL sound server(s) Search interface 3 (e.g. Penn) Search interface 4 (e.g. Lancaster ?)

  34. Corpora in the Year of Speech Spontaneous speech Spoken BNC ~1400 hrs Conversational telephone speech Read text LibriVox audio books Broadcast news US Supreme Court oral arguments Political discourse Oral history interviews US vernacular dialects/Sociolinguistic interviews

  35. Practicalities In order to be of much practical use, such very large corpora must be indexed at word and segment level All included speech corpora must therefore have associated text transcriptions We’re using the Penn Phonetics Laboratory Forced Aligner to associate each word and segment with the corresponding start and end points in the sound files

  36. Mining (indexing by forced alignment) x 21 million

  37. Mining (indexing by forced alignment)

  38. Mining (a needle in a haystack)

  39. Mining (a diamond in the rough)

  40. American and English Same set of acoustic models e.g. same [ɑ] for US “Bob” and UK “Ba(r)b” Pronunciation differences between different varieties were dealt with by listing multiple phonetic transcriptions

  41. Building a multi-dialect dictionary

  42. Building a multi-dialect dictionary

  43. Tools/user interfaces

  44. Issues we grappled with Funding logistics - US funding did not come through for 9 months Quality of transcriptions: errors Long untranscribed portions Large transcribed regions with no audio (lost in copying) Problems with documentation and records

  45. Issues we grappled with Broadcast recordings may include untranscribed commercials Transcripts generally edit out dysfluencies Political speeches may extemporize, departing from the published script

  46. Issues we grappled with Some causes of difficulty in forced alignment: Overlapping speakers Background noise/music/babble Transcription errors Variable signal loudness Reverberation Distortion Poor speaker vocal health/voice quality Unexpected accents

  47. Issues we’re still grappling with No standards for adding phonemic transcriptions and timing information to XML transcriptions Many different possible schemes How to decide?

  48. Anonymization The text transcriptions in the published BNC have already been anonymized Some parts of the audio (e.g. COLT) have also been published Full names, personal addresses and telephone numbers were replaced by <gap> tags We use the location of all such tags to mute (silence) the corresponding portions of audio

  49. Intellectual property responsibilities All <gap>s must be checked to ensure accuracy This is a much bigger job than we had anticipated (>13,000 anonymization 'gaps') Checking the alignment of gaps is labour-intensive/slow Compounded by poor automatic alignments

  50. Rare and unique wonders aeriated bolshiness canoodling drownded even-stevens gakky kiddy-fied mindblank noggin pythonish re-snogged sameyness stripesey tuitioning watermanship yukkified

More Related