210 likes | 313 Views
WP4-22. Final Evaluation of Subtitle Generator. Vincent Vandeghinste, Pan Yi CCL – KULeuven. Example. Transcript: Het meest spectaculaire aan de daadwerkelijke start van de euro is dat er eigenlijk niets spectaculairs te melden valt. Ondertitel:
E N D
WP4-22. Final Evaluation of Subtitle Generator Vincent Vandeghinste, Pan Yi CCL – KULeuven
Example Transcript: Het meest spectaculaire aan de daadwerkelijke start van de euro is dat er eigenlijk niets spectaculairs te melden valt. Ondertitel: Het meest spectaculaire aan de start van de euro was dat er niets spectaculairs te melden valt.
Availability Calculator • Pronunciation Time of Input Sentence => estimate nr of characters available in subtitle • If UNKNOWN, estimate it by • counting nr of syllables • Average speaking rate for Dutch
Syllable Counter • Rule-based • Evaluated on CGN-lexicon combined with FREQ-lists • Estimated nr Nr of syl in phonetic transcripts • 99.63% of all words in CGN is correctly estimated
Availability Calculator • When pronunciation time not given: estimate it • Subtitles: 70 chars / 6 sec = 11.67 chars/sec • If nr of chars > nr of available chars => compress sentence
Sentence Compressor • Parallel Corpus • Sentence Analysis • Sentence Compression • Evaluation
Parallel Corpus • Sentence aligned • Source & Target corpus: • Tagging • Chunking • SSUB detection • Chunk alignment
Chunk Alignment Every 4-gram from src-chnk is compared with every 4-gram from tgt-chnk A = ( m / (m+n)) . (L1 + L2)/2 If (A > 0.315) then Align Chunk F-value for NP/PP-alignment is 95%
Sentence Analysis • Tagging (TnT): accuracy = 96.2% (Oostdijk et al., 2002) • Chunking
Sentence Analysis (2) • SSUB detection
Sentence Compression • Use of statistics • Use of rules • Word reduction • Selection of the Compressed Sentence
Use of rules • To avoid generating ungrammatical sentences • Rules of type For every NP, never remove the head noun • Rules are applied recursively
Word Reduction • Example: replace gevangenisstraf by straf • Counterexample: replace voetbal by bal • Making use of Wordbuilding module (WP2) • Introduces a lot of errors: added accuracy? • Better integration with rest of system should be possible
Selection of the Compressed Sentence • All previous steps result in an ordered list of sentence alternatives • Supposedly grammatically correct • Sentences are ordered depending on their probability • First sentence (most probable) with a length smaller than available nr of chars is chosen
Subtitle Layout Generator Actieve of gewezen voetballers zoals Ruud Gullit of Dennis Bergkamp moeten het stellen met nauwelijks anderhalf miljard . wordt Actieve of gewezen voetballers zoals Ruud Gullit of Dennis Bergkamp moeten het stellen met nauwelijks anderhalf miljard .
Conclusion • System approach works very well: • If sentence analysis is correct • If there are possible reductions (according to the ruleset) • A lot of No Output cases: System cannot reduce sentence • Sentence cannot be reduced (even by humans) • Rule-set is too strict / Wrong sentence analysis • Not fine-grained enough statistical info • Bad output: • Wrong sentence analysis (CONJ) • Wrong word-reductions
Future • Near future (within Atranos) • Better integration of word-reduction • Combine advantages of CNTS approach and CCL approach into one approach • Far future (outside Atranos) • Better sentence analysis: full parse is needed • More fine-grained analysis of parallel corpus