240 likes | 331 Views
Better Punctuation Prediction with Dynamic Conditional Random Fields. Wei Lu and Hwee Tou Ng National University of Singapore. Talk Overview. Background Related Work Approaches Previous approach: Hidden Event Language Model Previous approach: Linear-Chain CRF This work: Factorial CRF
E N D
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore
Talk Overview • Background • Related Work • Approaches • Previous approach: Hidden Event Language Model • Previous approach: Linear-Chain CRF • This work: Factorial CRF • Evaluation • Conclusion
Punctuation Prediction • Automatically insert punctuation symbols into transcribed speech utterances • Widely studied in speech processing community • Example: >> Original speech utterance: >> Punctuated (and cased) version: you are quite welcome and by the way we may get other reservations so could you please call us as soon as you fix the date You are quite welcome .And by the way , we may get other reservations , so could you please call us as soon as you fix the date ?
Our Task Perform punctuation prediction for conversational speech texts without relying on prosodic features • Processing prosodic features requires access to the raw speech data, which may be unavailable • Tackles the problem from a text processing perspective
Related Work • With prosodic features • Kim and Woodland (2001): a decision tree framework • Christensen et al. (2001): a finite state and a multi-layer perceptron • Huang and Zweig (2002): a maximum entropy-based approach • Liu et al. (2005): linear-chain conditional random fields • Without prosodic features • Beeferman et al. (1998): comma prediction with a trigram language model • Gravano et al. (2009): n-gram based approach
Related Work (continued) • One well-known approach that does not exploit prosodic features • Stolcke et al. (1998) presented a hidden event language model • It treats boundary detection and punctuation insertion as an inter-word hidden event detection task • Widely used in many recent spoken language translation tasks as either a pre-processing (Wang et al., 2008) or post-processing (Kirchhoff and Yang, 2007) step
Hidden Event Language Model • HMM (Hidden Markov Model)-based approach • A joint distribution over words and inter-word events • Observations are the words, and word/event pairs are hidden states • Implemented in the SRILM toolkit (Stolcke, 2002) • Variant of this approach • Relocates/duplicates the ending punctuation symbol to be closer to the indicative words • Works well for predicting English question marks • where is the nearest bus stop ? • ?where is the nearest bus stop
Linear-Chain CRF • Linear-chain conditional random fields (L-CRF): Undirected graphical model used for sequence learning • Avoid the strong assumptions about dependencies in the hidden event language model • Capable of modeling dependencies with arbitrary non-independent overlapping features word-layer tags … Y1 Y2 Y3 Yn utterance X1 X2 X3 Xn
An Example L-CRF • A linear-chain CRF assigns a single tag to each individual word at each time step • Tags: NONE, COMMA, PERIOD, QMARK, EMARK • Factorized features • Sentence: • no , please do not . would you save your questions for the end of my talk , when i ask for them ? • COMMA NONE NONEPERIOD NONE NONE … NONECOMMA NONE … QMARK • no please do not would you … my talk when … them
Features for L-CRF • Feature factorization (Sutton et al., 2007) • Product of a binary function on assignment of the set of cliques at each time step, and a feature function solely defined on the observation sequence • Feature functions: n-gram (n = 1,2,3) occurrences within 5 words from the current word • Example: for the word “do”: • do@0, please@-1, would_you@[2,3], no_please_do@[-2,0] • COMMA NONE NONEPERIOD NONE NONE … NONECOMMA NONE … QMARK • no please do not would you … my talk when … them
Problems with L-CRF • Long-range dependency between the punctuation symbols and the indicative words cannot be captured properly • For example: • no please do not would you save your questions for the end of my talk when i ask for them • It is hard to capture the long range dependency between the ending question mark (?) and the initial phrase “would you” with a linear-chain CRF
Problems with L-CRF • What humans might do • no please do not would you save your questions for the end of my talk when i ask for them • no please do notwould you save your questions for the end of my talk when i ask for them • no , please do not .would you save your questions for the end of my talk , when i ask for them ? • Sentence level punctuation (. ? !) are associated with the complete sentence, and therefore should be assigned at the sentence level
What Do We Want? • A model that jointly performs all the following three tasks together • Sentence boundary detection (or sentence segmentation) • Sentence type identification • Punctuation insertion
sentence-layer tags Z1 Z2 Z3 … Zn word-layer tags Y1 Y2 Y3 … Yn utterance X1 X2 X3 Xn Factorial CRF • An instance of dynamic CRF • Two-layer factorial CRF (F-CRF) jointly annotates an observation sequence with two label sequences • Models the conditional probability of the label sequence pairs <Y,Z> given the observation sequence X
Example of F-CRF • Propose two sets of tags for this joint task • Word-layer: NONE, COMMA, PERIOD, QMARK, EMARK • Sentence-layer: DEBEG, DEIN, QNBEG, QNIN, EXBEG, EXIN • Analogous feature factorization and the same feature functions used in L-CRF are used • DEBEG DEIN DEIN DEIN QNBEG QNIN … QNIN QNIN QNIN … QNIN • COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK • no please do not would you … my talk when … them
Why Does it Work? • The sentence-layer tags are used for sentence segmentation and sentence type identification • The word-layer tags are used for punctuation insertion • Knowledge learned from the sentence-layer can guide the word-layer tagging process • The two layers are jointly learned, thus providing evidences that influence each other’s tagging process • [no please do not]declarative sent. [would you save your questions for the end of my talk when i ask for them]question sent. ? QNBEG QNIN …
Evaluation Datasets • IWSLT 2009 BTEC and CT datasets • Consists of both English (EN) and Chinese (CN) • 90% used for training, 10% for testing
Experimental Setup • Designed extensive experiments for Hidden Event Language Model • Duplication vs. No duplication • Single-pass vs. Cascaded • Trigram vs. 5-gram • Conducted the following experiments • Accuracy on CRR texts (F1 measure) • Accuracy on ASR texts (F1 measure) • Translation performance with punctuated ASR texts (BLEU metric)
Punctuation Prediction: Evaluation Metrics • Precision # correctly predicted punctuation symbols # predicted punctuation symbols • Recall # correctly predicted punctuation symbols # expected punctuation symbols • F1 measure 2 1/Precision + 1/Recall
Punctuation Prediction Evaluation: Correctly Recognized Texts (I) • The “duplication” trick for hidden event language model is language specific • Unlike English, indicative words can appear anywhere in a Chinese sentence
Punctuation Prediction Evaluation: Correctly Recognized Texts (II) • Significant improvement over L-CRF (p<0.01) • Our approach is general: requires minimal linguistic knowledge, consistently performs well across different languages
Punctuation Prediction Evaluation: Automatically Recognized Texts • 504 Chinese utterances, and 498 English utterances • Recognition accuracy: 86% and 80% respectively • Significant improvement (p < 0.01)
Punctuation Prediction Evaluation: Translation Performance • This tells us how well the punctuated ASR outputs can be used for downstream NLP tasks • Use Berkeley aligner and Moses (lexicalized reordering) • Averaged BLEU-4 scores over 10 MERT runs with random initial parameters
Conclusion • We propose a novel approach for punctuation prediction without relying on prosodic features • Jointly performs punctuation prediction, sentence boundary detection, and sentence type identification • Performs better than the hidden event language model and a linear-chain CRF model • A general approach that consistently works well across different languages • Effective when incorporated with downstream NLP tasks