280 likes | 365 Views
SEQUENCE PACKAGE ANALYSIS : A New Natural Language Understanding Method for Performing Data Mining of Help-Line Calls and Doctor-Patient Interviews AMY NEUSTEIN, Ph.D. LINGUISTIC TECNOLOGY SYSTEMS lingtec@banet.net. PRESENTATION TO NLUCS Workshop at ICEIS University of Portugal April 13, 2004.
E N D
SEQUENCE PACKAGE ANALYSIS:A New Natural Language Understanding Method for Performing Data Mining of Help-Line Calls and Doctor-Patient Interviews AMY NEUSTEIN, Ph.D.LINGUISTIC TECNOLOGY SYSTEMSlingtec@banet.net PRESENTATION TO NLUCS Workshop at ICEIS University of Portugal April 13, 2004
WHY DO WE NEED A NEW NATURAL LANGUAGE METHOD? 1) In the real world speakers do not always use “key” words that appear in the application vocabulary, which can lead to a poorword match between the user’s input and the application vocabulary. 2) To build a Statistical Language Model to accommodate to the various ways users speak requires a large data corpus that is costly to assemble, and still there is no guarantee that an accurate word match will be found.
APPLICATIONS OF SEQUENCE PACKAGE ANALYSIS: 1) An “add on” layer of intelligence to audio data mining programs used for recorded help-line calls to extract business intelligence data and to detect early warning signs of caller frustration. 2) An “add on” layer of intelligence for mining doctor-patient interviews to uncover important medical history data, often buried in the ambiguity of patient dialog.
How Does Sequence Package Analysis (SPA) Work?SPA provides a “filter” for the front end of a speech recognizer, using generic templates that can be deployed in many different applications and languages; SPA can be used with vector-based models that hold spaces and determine “global weighting” of lexical items. SPA parses NL dialog to locate a series of related turns that are discretely packaged as a sequence of conversational interaction.SPA locates entire sequence packages rather than isolated key words, operating on the principle that it is easier to find a generic sequence package in a dialog than specific keywords. That is, speakers are more likely to vary in their choice of keywords than in their conversational sequence patterns, making it more difficult for an speech application to represent a speaker’s wide range of word choices than to represent actual conversational sequence patterns.
METHODOLOGICAL BASIS OF SPASPA draws mainly from the field of conversation analysis: the study of the orderly properties of interactive dialog that revolve around the turn-taking process and other sequentially based features that are part of that process, such as the production of recycled turn beginnings when there is an overlap with a prior turn.SPA focuses on social action and how human-machine and human-human dialog is accomplished as a situated, interactive event. The discourse structures are therefore analyzed for their social interactive value rather than solely for their grammatical discourse structure.
ALGORITHMIC DESIGN OF SPA SPA algorithms, which are currently under development, consist of sequences that are either small segments of dialog or large sequences that can potentially span the entire dialog. But regardless of the size of the sequence package, the purpose of SPA is to locate the indigenous patterns in the dialog that evolve as the dialog unfolds. By using SPA to parse Natural Language dialog, those features which are evolving and dynamic (e.g., early warning signs of caller frustration; or a patient’s concerns about an illness) can be detected by grammars that are flexible enough to recognizedynamic patterns.
THE HEURISTIC VALUE OF SPA 1. Building Application Vocabularies: The SPA method of parsing dialog allows the discovery of new words, to be added to the application vocabulary, by locating the generic sequence packages in which such words appear. 2. Gathering Business Intelligence and Medical History Data: By tracking the nature and frequency of sequence packages, the system can identify important business intelligence data and medical history data that would have ordinarily eluded the system.
VALIDATION OF SEQUENCE PACKAGE ANALYSIS Does the addition of SPA improve speech recognition capabilities? Hypothesis “A”: By adding an SPA filter to a speech recognizer to improve analysis of speech input, one can significantly streamline the corpus of data required to build a Statistical Language Model. Hypothesis “B”: By adding an SPA filter to a Statistical Language Model that contains the full spectrum of possible utterances (as opposed to a streamlined corpus of data), the SLM can better differentiate among multiple utterances accepted by the recognizer.
USING SPA IN THE CALL CENTER: MINING HELP-LINE CALLS FOR BUSINESS INTELLIGENCE DATAAcaller needs a service call but rather than use words in the application vocabulary such as “service call” or “technician” this is what the frustrated caller says either to the IVR-driven auto attendant at the help-line desk or to the human agent at the call center.Caller: “I really can’t do this myself. I can’t get this to work without someone coming here. I really don’t know what to do with this.”
Finding the Sequence Package in the Dialog Example The sequence package consists of a repeated use of pronouns (and similar unnamed referents), standing in place of nouns, in very close proximity: • a short, condensed complaint-- referenced by pronouns (“I really can’tdo this myself”) • the amplification of the source of the trouble (and the request for assistance) but with the frequent use of pronouns that have no stated subject/object referents (“I can’t get this to work without someone coming here”) • a recycling of the first part of the complaint with the same patterned use of pronouns in place of nouns (“I really don’t know what to do with this”)
FILTERING THE INPUT First, the SPA “filter” would direct the speech engine to the second part of the complaint utterance-- the amplification of the source of the trouble (and request for assistance): “I can’t get this to work without someone coming here” Second, rather than run the whole utterance through the SLM, only the second part of the complaint would be run through the SLM to find its closest statistical approximation. Third, once the closest word match is made to this second part of the complaint, the SLM would then add this “new” phrase to the application vocabulary.
MINING HELP-LINE CALLS FOR SIGNS OF CALLER FRUSTRATION • An SPA-driven mining program would look for conversational sequencepatterns [instead of key words or changes in prosody] to detect signs of caller frustration. • While speakers vary widely in their choice of words and in their stress patterns [some speakers may increase their pitch when upset while others may not], their conversational sequence patterns -- which are derived from the highly systematic properties that guide the production of talk-- nevertheless remain consistent across a wide spectrum of callers.
Early Signs of Caller Australian Help-Line Desk Caller:“I’ve installed Office 97 and…I was a bit stupid. I went into uninstall and um pulled off a whole stack of items off the uninstall and it was a very silly thing to do so now when I start up my computer I get a screen um which say um a black- a black and white screen which says never delete this item. It’s a message screen and every time I start up it comes up……[deleted text]……………………………………… Caller: “I’m wondering if I reinstall will I wipe out [my documents]” Agent: “Okay, well look I could certainly have a technician look at the problem for you; we do charge for are you aware of that?” Caller: “I’m just asking a question - I’m just wondering whether or not I should uninstall Microsoft Word?”
USING SPA TO LOCATE THE RELEVANT CONVERSATIONAL SEQUENCE PATTERNS Step One: Locate the pre-question phrases to reports of troubles and requests for assistance: “I’m wondering if” “I’m just asking a question” “I’m just wondering whether or not” Step Two: Quantify the number of times and the proximity of such pre-question phrases. Step Three: Determine if they escalate or, in the alternative, diminish?
ANALYSIS The caller to the Australian help-line began her report of the trouble as a long winded narrative, but with the noticeable absence of a request for help. The caller later produced pre-question phrases when she made her request for help; however, these phrases began to escalate (by being combined with one another) just at the point where she began to show signs of frustration: “I’m just asking a question - I’m just wondering whether or not I should uninstall Microsoft Word?” As one can see, such conversational sequence patterns evolve within the dynamic flow of dialog. By applying an SPA approach one can pinpoint these indigenous features of talk that evade standard speech recognizers.
MINING MEDICAL INTERVIEWS THE PROBLEM: • Patients often give very important medical history data about themselves and other family members at the wrong place in the medical encounter (such as at the very end of the medical interview or during a routine physical exam) when the doctor is less likely to be paying attention in that he has already gone over those areas with the patient. • When patients give medical information at the wrong place in the interview, the data can be lost because the doctor’s attention is now focused on other things.
MEDICAL INTERVIEWS The Solution: SPA locates specific conversational sequence patterns in which crucial medical history data is embedded. By locating those sequence package templates, important medical history data can be extracted--similar to the way business intelligence data can be extracted from help-line calls.
ILLUSTRATION Patient withholds vital family history data about osteosarcoma (bone cancer). Patient discloses this information at the point in the medical encounter (viz., during a brief medical exam) when discussions of family history data were no longer the main topic. Patient embeds this history data about bone cancer in the form of a narrative -- as if she were casually telling a “story” to a neighbor or friend-- presumably hoping that by downplaying its significance the doctor would give it much less attention than had she come out with it directly when queried about family illnesses.
DIALOG SAMPLE Patient: “I become terribly worried about my pain, which reminds me of the arthritic pain that my sister had, which turned out to be bone cancer, so I worry whenever I have pain because I don’t know if it is what she had.”
THE SEQUENCE PACKAGETEMPLATE: AHIGH USAGE OF NARRATIVEPHRASES IN CLOSE PROXIMITY SEQUENCE PACKAGE DIVIDED INTO 4 PARTS: a short condensed and somewhat nonspecific concern preceded by a narrative phrase: I become terribly worried about my pain an expansion of the concern, citing the troublesome datum (“bone cancer”), which is embedded with two narrative predicates: which reminds me of the arthritic pain that mysister hadwhich turned out to bone cancer
SEQUENCE PACKAGE, CONT. a recycling of the nonspecific concern preceded by a narrative phrase: so I worry whenever I have any pain a reference back to the expanded concern, but only with the use of pronouns that serve as anaphors, referring back to the expanded concern: because I don’t know if it is what she had
EXTRACTING MEDICAL HISTORY DATA BY USING SPA The SPA “filter” would direct the speech engine to search for specific content material embedded within the two narrative predicates, appearing in the second part of the four-part sequence package (“which reminds me of…which turned out to be...”) By searching the sequence package templates, the mining program uncovers important family history data (arthritic pain, ultimately diagnosed as bone cancer) that the patient buried in the interview by using an informal narrative style, replete with anaphors and non specific referents, and by offering this family history data AFTER the physician had already completed his review of family medical history.
Mining Wiretapped Communications The following example shows how by applying an SPA approach to wiretapped dialog, one can flag important security information that is cleverly disguised by the suspects:
ILLUSTRATION Speaker “A” is trying to educate Speaker “B” about a new meeting place whose location is very important. Any confusion or misunderstanding about this meeting place could spoil the plans. But Speaker “A” is very clever: First, he stays away from buzz words (such as naming a bridge, a tunnel or a street). Second, he refrains from making any comments about how vital it is to get these instructions right.
Dialog Example Speaker “A”: Come to the intersection near Juniors? (the question mark shows an upward intonation) 0.2 - 0.5 second pause (speaker then pauses briefly) Speaker “B”: 1.2 second pause Speaker “A”: You know the thoroughfare with the big traffic light? Speaker “B”: Juniors, yeah.
THE SEQUENCE PACKAGE Speaker “A”: Come to the intersection near Juniors? 0.2-0.5 Speaker “B”: 1.2 seconds of silence • A noun referent (“Juniors”) with an upward intonation • A brief pause, giving the listener the chance to show recognition or ask for clarification. • Silence by the listener which indicates lack of understanding or confusion.
SEQUENCE PACKAGE CONT.Speaker “A”: You know the thoroughfare with the big traffic light?Speaker “B”: Juniors, yeah. • Speaker “A” produces a clarification of the noun referent (“Juniors”) (“You know the thoroughfare with...”) • Speaker “B” produces a repeatof noun referent (“Juniors”) - the source of the recognition trouble - followed by a recognitional marker (“Yeah”)--which demonstrates to Speaker “A” that he has “corrected” the misunderstanding. But had he simply produced a recognitional marker (‘yeah’) without mentioning the source of the trouble (“Juniors”), there would be no indication to the other speaker that he now recognizes the importance of this meeting place.
CODA SPA provides a new NLU method for designing intelligent software packages that can serve as “filter” for the front end of a speech recognizer. Since the SPA templates are generic, they can be deployed in many different applications and across many languages to do the following: 1) extract business intelligence data from call center recordings; 2) detect early warning signs of caller frustration in a help-line call; 3)uncover important medical history data buried in the medical interview; and 4)learn the plans and operations of suspected terrorists.