140 likes | 343 Views
Corpus-based evaluation of prosodic phrase break prediction. Claire Brierley and Eric Atwell School of Computing, University of Leeds. Prosody and prosodic phrase breaks. In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it
E N D
Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of Computing, University of Leeds Corpus Linguistics 2007, University of Birmingham
Prosody and prosodic phrase breaks In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984
Punctuation is a way of ‘annotating’ phrase breaks in text.. In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984
..and is therefore one text-based feature used in automatic phrase break prediction In the popular mythology the computer is a mathematics machine| it is designed to do numerical calculations| Yet it is really a language machine| its fundamental power lies in its ability to manipulate linguistic tokens| symbols to which meaning has been assigned| Terry Winograd, 1984
Positional syntactic features: n-grams <NN><VBN><NP> Once upon a time | there will be a little girl calledUncumber. | Uncumber will have a younger brother called Sulpice|and theywill live with their parents|in a house in the middle of the woods.| upon a time = trigram where we expect a boundary next the middle of = trigram which mightinclude a boundary live with = bigram which might include a boundary girl called = bigram where we might have a boundary next and which might also include a boundary…
Some top class phrase break models There are 2 generic approaches: Deterministic or rule-based:chink chunkorCFP (Liberman & Church, 1992) They will live | with their parents | in a house | in the middle | of the woods | Probabilistic or statistical: e.g. as used in Festival (CSTR) (Taylor & Black, 1998) 79% breaks-correct on MARSEC (Roach, P. et al, 1993)
Shallow or chunk parsing Source: http://ironcreek.net/phpsyntaxtree/ [S [PP [IN In] [NP [AT the] [JJ popular] [NN mythology]]][NP [AT the] [NN computer]] [VP [BEZ is] [NP [AT a] [NN mathematics] [NN machine.]]]] In the popular mythology | the computer is a mathematics machine. Chunk parse rule - using NLTK version 0.6: parse.ChunkRule('<IN><IN|DT|DTI|AT|AP|CD|OD|PPO|PN|POSS|JJ|JJT|JJS|NP|N N|NNS>+', “<Chunk a preposition> <with sequences of other prepositions, determiners, numbers, certain pronouns, adjectives and nouns - and these can be in any order>”)
The classification task Task: to classify junctures between words Train the model on “gold standard” speech corpus: training data: PoS tags + boundary tags Test the model: unseen test set quantitative metrics % boundaries correct? % insertion & deletion errors? Model type: deterministic or probabilistic? break or non-break? rules or features?
Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Rule-based variant Given the state | of lawlessness | that exists | in Lebanon the uninformed outsider | might reasonably expect security | at Beirut airport | to be | amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 55.55% Aix-MARSEC Corpus: annotated transcript of 1980s BBC news commentary
Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Intuitive prosodic phrasing Given the state of lawlessness that exists in Lebanon | the uninformed outsider |might reasonably expect| security | at Beirut airport | to be amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 71.43% “..the very notion of evaluating a phrase-break model against a gold standard is problematic as long as the gold standard only represents one out of the space of all acceptable phrasings..” (Atterer and Klein, 2002)
Current work: developing a prosody lexicon incoming corpus text • already PoS-tagged • format: list of tuples • [..(‘gone’, ‘VBN’),..] intersection with Python dictionary • get some more tags • e.g. CFP, stress pattern • [..(‘gone’, ‘VBN’, ‘C’, ‘1’),..] • these tags are text-based features Sources used: • Computer-usable dictionary CUVPlus (Pedler, 2002) - incorporates C5 PoS tags • Lexical stress patterns derived from CELEX2 database (Baayen et al, 1995) and Carnegie-Mellon Pronouncing dictionary (CMU, 1998)
Lexicon fields - and lookup • Python dictionary syntax stores the above information as (key, value) pairs { (‘cascades’, ‘NN2’) : [‘0’, ‘k&'skeIdz’, ‘Kj%’, ‘NN2:1’, ‘2’, ‘01’, ‘C’] (‘cascades’, ‘VVZ’) : [‘0’, ‘k&'skeIdz’, ‘Ia%’, ‘VVZ:-1’, ‘2’, ‘01’, ‘C’] } • Incoming corpus text - also in the form of (token, tag) tuples - can be matched against dictionary keys • Thus intersection enables corpus text to accumulate additional values which have the potential to become features for machine learning tasks
What I’d like to achieve • Develop phrase break predictors representative of two generic approaches - rule-based and probabilistic and compare their performance. • Use the WEKA toolkit plus training data from the Aix-MARSEC corpus(Auran et al, 2004) which has linguistically sophisticated prosodic annotations, to explore a new mix of features for machine learning of phrase break prediction. This is where the prosody lexicon comes in. • Develop a purpose-built corpus of different text genres and different annotation schemes to moderate the process of evaluating these phrase break models against one prosodic template. • If I can develop a good model, then a possible contribution to the Aix-MARSEC project may be to enrich this gold standard by generating alternative prosodic markup to the corpus linguists’ analysis. Outputs from the model would potentially represent legitimate, variant phrasing strategies to those already uncovered and provide new prosodic templates for the evaluation of phrase break models.
Input text: list of token, tag tuples [.,('that', 'CS'), ('individual', 'JJ'), ('willingness', 'NN'), ('to', 'TO'), ('pay', 'VB'), ('should', 'MD'), ('be', 'BE'), ('the', 'ATI'), ('main', 'JJB'), ('test', 'NN'), ('of', 'IN'), ('how', 'WRB'), ('resources', 'NNS'), ('are', 'BER'), ('used', 'VBN'), ('.', '.'),.] SEC: annotated transcript of Reith Lecture Input text is temporarily tagged with C5 for lexicon lookup Mapping C5 LOB is usually a case of one-to-many However, C5 has separate tags for ‘that’ and ‘of’ - a case of many-to-one CJS (subordinating conjunction) or CJT (that) CS and PRP (preposition) or PRF (of) IN Need to resolve this to accomplish Python dictionary lookup (preferred option) or use different lookup mechanism (hopefully not!) Problem compounded with introduction of different PoS tag sets as consequence of planned composite test corpus Example problem - still working on it!