160 likes | 422 Views
NLP: Grand Challenges. The Ultimate Goal – For computers to use NL as effectively as humans do…. Reading and writing text Abstracting Monitoring Extraction into Databases Interactive Dialogue: Natural, effective access to computer systems Informal Speech Input and Output
E N D
NLP: Grand Challenges The Ultimate Goal – For computers to use NL as effectively as humans do…. Reading and writing text • Abstracting • Monitoring • Extraction into Databases Interactive Dialogue: Natural, effective access to computer systems • Informal Speech Input and Output Translation: Input and Output in Multiple Languages
The Past: Human Machine Interfaces • SHRDLU (Winograd, 1969) • A fragile demonstration of the fundamental vision • PUNDIT (Hirschman, Palmer, … 1989) • Information Extraction from Real World texts
SHRDLU: A demonstration proof Person: PICK UP A BIG RED BLOCK. Computer: OK. (does it) Person: GRASP THE PYRAMID. Computer: I DON'T UNDERSTAND WHICH PYRAMID YOU MEAN. Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX. Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING. Computer: OK. (does it) Person: WHAT DOES THE BOX CONTAIN? Computer: THE BLUE PYRAMID AND THE BLUE BLOCK. Person: WHAT IS THE PYRAMID SUPPORTED BY? Computer: THE BOX.
The Past: Information Extraction PUNDIT (Palmer et al. 1987) Sample CASREP & automatic summary FAILURE OF ONE OF TWO SACS. UNIT HAD LOW OUTPUT AIR PRESSURE. RESULTED IN SLOW GAS TURBINE START. TROUBLESHOOTING REVEALED NORMAL SAC LUBE OIL PRESSURE AND TEMPERATURE. EROSION OF IMPELLOR BLADE TIP EVIDENT. CAUSE OF EROSION OF IMPELLOR BLADE UNDETERMINED. NEW SAC RECEIVED.
The Past: Crucial flaws in the paradigm These systems worked well, BUT • Usually, only for a small set of examples • Person-years of work to port to new applications and, often, to extend coverage on a single application • Very limited and inconsistent coverage of English
An Early Robust Statistical NLP Application • A Statistical Model For Etymology • Church, K.W. (1985) "Stress assignment in letter to sound rules for speech synthesis", Proceedings of the 23rd Annual Meeting (University of Chicago), [text to speech; phonetics] • Determining etymology is crucial for text-to-speech
An Early Robust Statistical NLP Application • Etymology can be determined reasonably accurately from statistics computed from letter sequences trigrams!
A Central Challenge: Extracting Meaning ??Meaning Extractor?? Text or speech Meaning
Literal vs. Implicit Meaning • Cognitive beings automatically • combine literal meaning • with world knowledge • to see implicit meaning • Q: Whose greed? Q: Whose ambition? • Understanding this involves inferring implicit meaning • Recent NLP has focused on robust extraction of shallow, literal meaning “The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday… The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," an official said.”
Levels of Representation Full Semantics Explicit Semantics Syntax Words Morphology
The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday. Khan made the confession in a written statement submitted "a couple of days ago" to investigators probing allegations of nuclear proliferation by Pakistan, the official told The Associated Press on condition on anonymity. The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," the official said. The official said the transfers were not authorized by the government. Unigrams Word Unigram Representation
The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday. Khan made the confession in a written statement submitted "a couple of days ago" to investigators probing allegations of nuclear proliferation by Pakistan, the official told The Associated Press on condition on anonymity. The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," the official said. The official said the transfers were not authorized by the government. Bigrams Word Bigram Representation
The • founder • of • Pakistan’s • nuclear department • Abdul Qadeer Khan • has • admitted • he • transferred • nuclear technology • to • Iran, • Libya, • and • North Korea NP NP PP NP S NP NP VP VP SBAR NP S VP NP PP NP NP NP NP Syntax Representation: Treebank • TreeBank includes • Part of speech • Syntactic structure
The • founder • of • Pakistan’s • nuclear department • Abdul Qadeer Khan • has • admitted • he • transferred • nuclear technology • to • Iran, • Libya, • and • North Korea NP NP PP NP S NP NP VP VP SBAR NP S VP NP PP NP NP NP NP 1995: A breakthrough in parsing 106 words ofTreebank Annotation + Machine Learning = Robust Parsers Training Program The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea training sentences answers Models Trees Parser • 1990 Best hand-built parsers: ~40-60% accuracy (guess) • 1995+ Statistical parsers: ~90% accuracy
Rich Linguistic Representations + Powerful Machine Learning = Robust, Effective NLP 1970s, ’80s: Focus on Linguistic Representations 1990s, early 2000s: Focus on Machine Learning Recently: New work combining the two