200 likes | 402 Views
CS4705. Corpus Linguistics and Machine Learning Techniques. Review. What do we know about so far? Words (stems and affixes, roots and templates,…) Ngrams (simple word sequences) POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …). Some Additional Things We Could Find.
E N D
CS4705 Corpus Linguistics and Machine Learning Techniques CS 4705
Review • What do we know about so far? • Words (stems and affixes, roots and templates,…) • Ngrams (simple word sequences) • POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)
Some Additional Things We Could Find • Named Entities • Persons • Company Names • Locations • Dates
What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations • Find Named Entities (person names, company names, telephone numbers, addresses,…) • Find topic boundaries and classify articles into topics • Identify a document’s author and their opinion on the topic, pro or con • Answer simple questions (factoids) • Do simple summarization/compression
But first, we need corpora… • Online collections of text and speech • Some examples • Brown Corpus • Wall Street Journal and AP News • ATIS, Broadcast News • TDTN • Switchboard, Call Home • TRAINS, FM Radio, BDC Corpus • Hansards’ parallel corpus of French and English • And many private research collections
Next, we pose a question…the dependent variable • Binary questions: • Is this word followed by a sentence boundary or not? • A topic boundary? • Does this word begin a person name? End one? • Should this word or sentence be included in a summary? • Classification: • Is this document about medical issues? Politics? Religion? Sports? … • Predicting continuous variables: • How loud or high should this utterance be produced?
Finding a suitable corpus and preparing it for analysis • Which corpora can answer my question? • Do I need to get them labeled to do so? • Dividing the corpus into training and test corpora • To develop a model, we need a training corpus • overly narrow corpus: doesn’t generalize • overly general corpus: don't reflect task or domain • To demonstrate how general our model is, we need a test corpus to evaluate the model • Development test set vs. held out test set • To evaluate our model we must choose an evaluation metric • Accuracy • Precision, recall, F-measure,… • Cross validation
Then we build the model… • Identify the dependent variable: what do we want to predict or classify? • Does this word begin a person name? Is this word within a person name? • Is this document about sports? The weather? International news? ??? • Identify the independent variables: what features might help to predict the dependent variable? • What is this word’s POS? What is the POS of the word before it? After it? • Is this word capitalized? Is it followed by a ‘.’? • Does ‘hocky’ appear in this document? • How far is this word from the beginning of its sentence? • Extract the values of each variable from the corpus by some automatic means
An Example: Finding Caller Names in Voicemail SCANMail • Motivated by interviews, surveys and usage logs of heavy users: • Hard to scan new msgs to find those you need to deal with quickly • Hard to find msg you want in archive • Hard to locate information you want in any msg • How could we help?
Caller SCANMail Architecture SCANMail Subscriber
Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) • Also recognized using ASR engine
Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2 actually offered to take J home with her and then would she
would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]
SCANMail Demo http://www.avatarweb.com/scanmail/ Audix extension: demo Audix password: (null)
Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs to present in headers • Approach: • Supervised learning from transcripts (phone #’s, caller self-ids) • Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules • Two stage approaches
Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)
Telephone Number Identification • Rules convert all numbers to standard digit format • Predict start of phone number with rules • This step over-generates • Prune with decision-tree classifier • Best features: • Position in msg • Lexical cues • Length of digit string • Performance: • .94 F on human-labeled transcripts • .95 F on ASR)
Caller Self-Identifications • Predict start of id with classifier • 97% of id’s begin 1-7 words into msg • Then predict length of phrase • Majority are only 2-4 words long • Avoid risk of relying on correct speech recognition for names • Best cues to end of phrase are a few common words • ‘I’, ‘could’, ‘please’ • No actual names: they over-fit the data • Performance • .71 F on human-labeled • .70 F on ASR