130 likes | 141 Views
This project aims to develop a man-machine interface that combines analysts' reasoning and linguistic reasoning to extract relevant facts from reports. It explores a phrase-based approach to Natural Language Processing and the use of Conceptual Models for inference. The goal is to support analysts in quickly analyzing large amounts of structured data.
E N D
ACITA 12 demo outlinev0 International Technology Alliance In Network & Information Sciences Dr David Mott (IBM UK)
Supporting the "Analyst" doc27 doc27 Requirements doc27 Assumptions NLP Analysts Conceptual Model CE Facts Product Linked data web Reference data Inference Rationale Query CE Facts other data Uncertainty Argumentation CE Tools The analyst does not have time to read all the reports Structured data
Purpose • To demonstrate our current capability in Fact Extraction using CE • To demonstrate the cycle of creating of new analysts concepts and the inference of high value information • To explore how a man-machine interface may be built based on CE, rationale and a mixture of analysts reasoning and linguistic reasoning • [To explore potential benefits and costs of a phrase based approach to NLP (as opposed to textual patterns).]
Approach - technology Analysts Conceptual Model (including rules) Linked data web CE Reference data Product other data
Approach Not just write some linguistic rules and see what happens • Select a small set of sentences and see what could be inferred from it: • Conceptual model • Analyst rules to perform forensic analysis • Linguistic rules to support extraction of relevant facts from reports • List what type of information would be needed to support this in the general case • Verbnet, … • Design an approach to generalise this example • Drives the next year research. • Run this general approach on all sentences and see what happens • Do we need to change the set of sample sentences?
Scenario - input • Use SYNCOIN reports • These include call monitoring and reported texts of the conversations. • A SYNCOIN report may contain several individual sentences '02/24/10 - RT: 2345hrs -- (Delayed report) - Cell phone call monitored on 02/23/10; ET: 0957hrs from an unidentified male (7001408055) in Rashid to an unidentified male (7678112233) in Amin-Habib. The call came immediately following an IED attack of U.S. convoy on Airport Road. The two parties were arguing, the caller stated: "Your team is a failure, we cannot operate this way." The recipient replied: "The materials must have been defective, the design was perfect." Mixture of grammatical and informal text, including specific abbreviations
Scenario – Analyst’s task • Initially we analyse text for information on conversations • Should be useful to find who talks to whom • Then analyst has an idea for a new concept: • “replay conversation” where A talks to B, then B talks to C • Might be useful to track conversations and flow of information? • Analyst creates concept and system analyses SYNCOIN data • Some replay messages are detected and shown to the analyst for manual inspection • But also leads to an interesting “rosetta stone” effect, of decoding some informal message codes (see below) • Earlier reports now have increased forensic value • system assists analyst to locate earlier reports that were previously unnoticed and decodes their meaning in the light of this knowledge
The “carpet code” • Conversation1: • '02/24/10 - RT: 2345hrs -- (Delayed report) - Cell phone call monitored on 02/23/10; ET: 0957hrs from an unidentified male (7001408055) in Rashid to an unidentified male (7678112233) in Amin-Habib. The call came immediately following an IED attack of U.S. convoy on Airport Road. The two parties were arguing, the caller stated: "Your team is a failure, we cannot operate this way." The recipient replied: "The materials must have been defective, the design was perfect." • Conversation2: • '02/24/10 - Cell call is monitored between unknown caller (7678112233) in Amin to Amir Mahallati (7115452376) in Bayaa. The unidentified caller stated: "The team is a failure! The carpet doesn't match! The carpet maker needs to be replaced." The recipient said: "The measurements were perfect, the installers must have failed.“ • Common person/phone number: (7678112233) • Carpet = IED • Measurements = design • Importance of earlier message was missed, now it increases in importance: • '02/01/10 - ET: 0345hrs -- Cell phone call monitored between an unidentified male (7678112233) in Amin-Hibib, Iraq //MGRSCOORD: 38S ND 13 05// and an unidentified male (7115452376) in Bayaa //MGRSCOORD: 38S MB 38 81//. The caller stated: "Start buying carpets for the house like we discussed." The call lasted 10 seconds.'
Conceptual Model • Call monitoring: • date … • Call • Type: cell phone • sender, recipient, text, date, length • sequencing of utterances and assignment to speakers • importance of call • Phone numbers • Location • association with people • Replay conversation • call1, call2, middle person • similarity relationship between texts • Code links – links between two words that express a code? • Suitable “expresses” are needed as well
Preprocessing • Principle: turn ungrammatical conventions into an equivalent and correct grammatical phrase that the parser can handle • eg remove “;” • Specific patterns for “identity tagging”: • “( XXX )” -> “tagged as XXX” • Dates eg: • “02/23/10; ET: 0957hrs ” -> “, estimated time 02/23/10 t 0957hrs,” (to be determined what is best) • MGRS
New general linguistic processing • Handle passive sentences • Is/was/were + past participle = passive • Identity tagging • Use “tagged as” patterns constructed in preprocessing, to infer “sameas” links between things with same tag • are these valid inferences of identity from other info (eg cell no?) • Dialog context • Includes a set of entities already encountered • Assume all sentences in a SYNCOIN report are in the same dialog context, and no dialog context across reports • All “stands for” things added to the report dialog context • Anaphoric references are de-referenced by searching in the dialog context: • “the call” (via last thing of type mentioned) • “it” (via last thing mentioned)
Domain specific steps - system • Generate call entities and call monitor situations • (Optional) domain specific information to filter out unreasonable “same as” from identity tags • Use additional sentences in report, together with anaphoric reference via dialog context to add the text to the calls. • Rules to detect replay conversations • (Optional) detect increased relevance of replay conversation, eg “immediately following …” • (Optional) check text of replay conversations to detect any obvious similarities to increase likelihood of replay conversation • (Optional) perform further inference on how information is passed across a network of people
Domain specific steps - GUI • Display replay conversations to user with the texts aligned where possible • Allow user to review rationale of the conversation as replay (linguistic and analyst reasoning) • Why construction of calls and monitors and their participants • Why “same as” inferences? • Why text similarity • Why is it a replay conversation • Why is it high value? • Allow user to accept/deny replay conversations • (Optional) Show links between people established by replay conversations • Allow user to add new code word links and establish new search criteria on reports • Allow user to review the significance of previous reports on the basis of the new code keywords