Summarization of Broadcast News using Speaker Tracking

Summarization of Broadcast News using Speaker Tracking Sree Harsha Yella, Kishore Prahallad, Vasudeva Varma LTRC, IIIT-Hyderabad.

Introduction • Summarization is a process of extracting important information present in a media either by extraction or abstraction and presenting it to the user in desired manner. • Speech summarization systems take speech signal as input and give summary in either text or speech form. Speech signal Speech Summarization summary Text/Speech

Previous work • Broadly classified into two categories • Application of text summarization approaches such as MMR and LSA on ASR output Speech signal text summary • Supervised systems using both lexical and acoustic features ASR Text Summarization ASR Lexical features Classifier Summary Speech signal Signal Processing Acoustic features training Manual summaries

Previous work • Issues • Dependence on ASR output and/or • Human reference summaries for training classifier • Current work aims to summarize news without above issues using anchor speaker tracking

Scope and Aim • Scope • Broadcast news featuring an anchor speaker with reporters and other speakers taking turns to present a news story • Aim of current summarizer • Generate extractive audio summaries that are indicative or informative

Motivation • Anchor speaker segments are precise, informative and well formed • They may form a good candidates for extractive summarization • Key idea • Apply speaker tracking technology to track anchor speakers • Use automatically identified anchor speaker's segments for speech summarization

Dataset used • Global news podcast of BBC news • Daily report of news stories all round the world • Single anchor speaker in a show • Total 10 shows with 5 anchor speakers • 3 male and 2 female

Overview of system • Block diagram of summarization system Speech signal Feature Extraction Anchor speaker tracking Post processing Final anchor segments Concatenation Summary

Speaker Tracking • Auto associative neural network • Identical mapping in input space • Input vector is given as desired output • Iterative training • 13 mfcc’s for each speech frame • Initial 30 seconds of speech from anchor speaker • Confidence measure (c[n]) on mean squared error (e[n]) • c[n] = -exp(e[n]), where n is the frame number

Speaker Tracking • Smoothed confidence contour with anchor regions marked

Speaker Tracking • Iterative training • Smoothed confidence contour is divided into non-overlapping segments of 5 sec each • Threshold is computed as mean of the confidence scores of the training data • Confidence scores of each segment are compared against the threshold • Identified segments are added to training data • The process is repeated until the model converges

Performance of Speaker Tracking

Post processing • Missed segment identification • Unidentified segments having anchor speaker segments on either side • False alarm detection • Isolated segments without an anchor speaker segment in the neighbourhood of 10 sec • Final anchor speaker segments are obtained by adding the missed segments and removing the false alarms

Summary construction • Concatenation with compression • Compression ratio (cr)‏ • Summary length (Sl) = cr * (Tl), where Tl is total length of news show • Approximate number of news stories • Final anchor speaker regions (N)‏ • Duration of each story in summary (D)‏ • D = Sl/N • Concatenation of initial D seconds of speech from each news story

Evaluation • Two types of evaluations • Rouge based evaluation • To measure n-gram overlap between reference summaries and automatic summary. • Human evaluation • To evaluate the quality of audio summaries

Evaluation • ROUGE based evaluation • Automatic audio summaries are transcribed by hand into text • Model summaries are generated by humans for 25% compression ratio

Evaluation • Recall (solid line), Precision (dashed line), F-measure (dotted line) for various cr

Evaluation • Human evaluation • Question based • Questions of type what, where, who, when • 5 under graduate students listened to summaries of different compression ratios and answered the questions

Conclusions • Proposed a method to generate automatic audio summaries for broadcast news • Good overlap between reference summaries and automatic summaries • Audio summaries showed an increase in recall with increase in compression ratio without much drop in the precision • Human evaluation of the audio summaries also show a similar trend

References • H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. 2004. From text summarisation to style-specific summarisation for broadcast news. In ECIR. • S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori. 2004. Speech-to-text and speech-to-speech summarization of spontaneous speech. Speech and Audio Processing, IEEE Transactions on, 12(4):401–408, July. • A. Inoue, T. Mikami, and Y. Yamashita. 2004. Improvement of speech summarization using prosodic information. In Proc. Speech Prosody, Japan. • Balakrishna Kolluru, Heidi Christensen, and Yoshihiko Gotoh. 2005. Multi-stage compaction approach to broadcast news summarisation. In Proceedings of Eurospeech, pages 69–72. • S. Maskey and J. Hirschberg. 2008. Intonational phrases for speech summarization. In Interspeech. • Inderjeet Mani. 2001. Automatic Summarization. John Benjamins. • B. Yegnanarayana, K. Sharat Reddy, and S. P. Kishore. 2001. Source and system features for speaker recognition using aann models. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 409–412. • K. Zechner. 2001. Automatic generation of concise summaries of spoken dialogues in unrestricted domains. R and D in IR, pages 199–207. • B. Yegnanarayana. 2004. Artificial Neural Networks. Prentice-Hall of India Pvt.Ltd.

Summarization of Broadcast News using Speaker Tracking