230 likes | 380 Views
Audient: An Acoustic Search Engine. Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University of Ulster, Magee. Aims and Objectives.
E N D
Audient: An Acoustic Search Engine Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University of Ulster, Magee
Aims and Objectives • Development of Audient as a speech-centric, non-lexical search engine capable of handling multimodal queries for retrieving spoken audio information • Explore the efficacy of using standards-based phonogrammic streams as an internal data representation for storing, indexing, searching and retrieving spoken audio information • Compare the performance of optional compound strategies for the abstraction and refinement of standards-based phonogrammic streams • Design, implement, refine and test Audient • Demonstrate research results, comparing Audient with other existing system architectures
Literature Review • Information Retrieval • Automatic Speech Recognition and Spoken Document Retrieval • Current and previous research in SDR systems • Public access SDR systems • Commercial ASR and audio mining products • Sub-word based approaches to SDR • Transcripts, annotation and phonogrammic streams • Speech and non-speech audio
Information Retrieval • Typical Information Retrieval (IR) tasks involve the retrieval of relevant information items from various types of documents by matching a user request or query. • IR encompasses document media types containing different types of information like images, video and audio information in addition to text documents. Audio recordings of speech can be referred to as spoken documents.
Automatic Speech Recognition • ASR attempts to mimic the human capacity for recognising speech by enabling a computer to identify spoken words and/or sub-word units. Most current ASR systems are lexical in nature, and conceptually follow the processes of encoding and decoding introduced in the figure below: (adapted from Young et al., 2002)
Spoken Document Retrieval • A significant amount of research has been conducted in SDR, and performance evaluations like the Text REtrieval Conference (TREC) have encouraged development and the sharing of information. A diagram representing a typical TREC SDR process is reproduced below: (Garfolo et al., 2000)
SDR Systems • CMU Informedia I, Informedia II and Sphinx Projects(Hauptmann and Witbrock, 1997) • Video Mail Retrieval and Multimedia Document Retrieval projects(Jones et al., 1997, Spärck Jones et al., 2001) • SCAN (Choi et al., 1998 and Choi et al., 1999) • THISL and Abbot (Abberley et al., 1998, Abbot, 1999) • Taiscéalaí (Smeaton et al., 1998)
Public Access SDR Systems • SpeechBot (Quinn, 2000, Van Thong et al., 2001) • National Public Radio (NPR) Online(NPR, 2000, NPR Archives, 2004) • SpeechFind and The National Gallery of the Spoken Word (Hansen et al., 2004, Zhou and Hansen, 2002)
Commercial ASR and Audio Mining Products • BBN Rough ‘n’ Ready (Kubala et al., 1999) • Nexidia Fast-Talk and Convera RetrievalWare(Clements et al., 2001a, Clements et al., 2001b) • ScanSoft (Network Speech, 2004, Embedded Speech, 2004, MediaIndexer, 2004, NaturallySpeaking, 2005, AudioMining, 2005, Xmode, 2004) • Virage AudioLogger (Virage, 2004) • Nuance (Nuance, 2005) • AT&T SCANMail (Hirschberg et al., 2001 and SCANMail, 2003) • Microsoft Speech Server (MSS, 2005)
Sub-word Based Approaches to SDR • Wechsler (Wechsler, 1998) • Ng., K. (Ng, 2000) • Glavitsch and Schäuble (Glavitsch and Schäuble, 1992) • Ng., C. (Ng, 2001) Also other sub-word research efforts including Larson (2001), Moreau et al. (2004)
Phonogrammic Streams Orthographical representations of phonemic streams. This abstraction is ancient, and partially inherent in the English alphabet. Egyptian hieroglyphs with semantic and phonetic value. Ref. http://www.omniglot.com/writing/egyptian.htm
Transcription SILENCE HARD ROCK SILENCE 1-best transcriptions N-best transcriptions (Fundamentals, 2005) (Fundamentals, 2005) Lattices or graphs
Annotation - Markup Languages and MPEG-7 • SSML • VoiceXML • SALT • XHTML+Voice profileAll of the above markup languages contain SSML as a subset • MPEG-7 and spoken content
MELDEX Musipedia (Melodyhound/Tuneserver) Sonoda Super MBox MIRACLE SMILE Shazam Name That Clip The Humdrum Toolkit Themefinder Boogeebot Muscle Fish Non-Speech Audio Retrieval Processing of speech is handled differently by humans than non-speech acoustic information.
Project Proposal Audient Architecture
Audient Parrots Functional diagram for an Audient Parrot Determining recognition differences She sells sea shells by the seashore. She cells C shels bye the sea shore
Software Analysis • Hidden Markov Model Toolkit (HTK) • LVCSR and CSLU Toolkit • Sphinx-2, Sphinx-3, Sphinx-4 • TIMIT • Linux and C++ • Perl and PHP • Festival • The CMU Pronouncing Dictionary • SSML, VoiceXML, SALT and X+V • The Apache Web Server
Possible IR and Monitoring Applications • The indexing search and retrieval of Internet audio files • Indexing search and retrieval of broadcast media • Services for the blind • Library services • Surveillance and intelligence gathering • Voice mail • Audio mining and trend analysis (topic detection and tracking)
Possible Philosophical and Cognitive Research Applications • Artificial self-learning systems • Philosophical investigations of speech-centric versus text-centric methods • Research models for cognitive science and consciousness theories • Examination of behaviourist versus cognitive semantic recognition of speech
Conclusion • The introduction of standards-based phonogrammic streams as a fundamental internal data structure • Support for unconstrained multimodal queries • The development of new mimetic means for comparative evaluation and demonstration • The provision of contextual strategies for the refinement of phonogrammic streams • Movement of the man-machine boundary to allow more effective partitioning of tasks between the human and the machine portions of the system • Design, implementation and testing of the Audient acoustic search engine