590 likes | 789 Views
NLP. Natural language processing Combines AI and linguistics A component in HCI in that we want a more “human” ability to communicate with computers Primarily revolves around NLU – natural language understanding NLG – natural language generation But also encompasses
E N D
NLP • Natural language processing • Combines AI and linguistics • A component in HCI in that we want a more “human” ability to communicate with computers • Primarily revolves around • NLU – natural language understanding • NLG – natural language generation • But also encompasses • Automated summarization/classification of articles and information extraction • Machine translation • Question answering • Information retrieval • And to lesser extents, speech recognition and optical character recognition
Understanding Language • NLU is not merely a matter of mapping words to meanings, instead we need to • stemming/morphological segmentation • part of speech (POS) tagging (identifying grammatical role for given word) • syntactic parsing • word sense disambiguation (WSD) (identifying the proper meaning for the words in a sentence) • named entity recognition • identifying the underlying meaning of a sentence • apply context (previous sentences) to understanding of current sentence • resolve references within and across sentences • apply worldly knowledge (discourse, pragmatics) • represent the meaning in some useful/operable way
NLU Problems • Words are ambiguous (different grammatical roles, different meanings) • Sentences can be vague because of the use of references (“it happened again”), and the assumption of worldly knowledge (“can you answer the phone” is not meant as a yes/no question) • The same statement can have different meanings • “where is the water?” • to a plumber, we might be referring to a leak • to a chemist we might be referring to pure water • to a thirsty person, we might be referring to potable water • NLU reasoner may never be complete • new words are added, words take on new meanings, new expressions are created (e.g., “my bad”, “snap”) • There are many ways to convey one meaning
Fun Headlines • Hospitals are Sued by 7 Foot Doctors • Astronaut Takes Blame for Gas in Spacecraft • New Study of Obesity Looks for Larger Test Group • Chef Throws His Heart into Helping Feed Needy • Include your Children when Baking Cookies • Iraqi Head Seeks Arms • Juvenile Court to Try Shooting Defendant • Kids Make Nutritious Snacks • British Left Waffles on Falkland Islands • Red Tape Holds Up New Bridges • Clinton Wins on Budget, but More Lies Ahead • Ban on Nude Dancing on Governor’s Desk
Ways to Not Solve This Problem • Simple machine translation • we do not want to perform a one-to-one mapping of words in a sentence to components of a representation • this approach was tried in the 1960s with language translation from Russian to English • “the spirit is willing but the flesh is weak” “the vodka is good but the meat is rotten” • “out of sight out of mind” “blind idiot” • Use dictionary meanings • we cannot derive a meaning by just combining the dictionary meanings of words together • similar to the above, concentrating on individual word translation or meaning is not the same as full statement understanding
What Is Needed to Solve the Problem • Since language is (so far) only used between humans, language use can take advantage of the large amounts of knowledge that any person might have • thus, to solve NLU, we need access to a great deal and large variety of knowledge • Language understanding includes recognizing many forms of patterns • combining prefix, suffix and root into words • identifying grammatical categories for words • identifying proper meanings for words • identifying references from previous messages • identifying worldly context (pragmatics) • Language use implies intention • we have to also be able to identify the message’s context and often, communication is intention based • “do you know what time it is?” should not be answered with yes or no
Restricted Domains • Early attempts at NLU limited the dialog to a specific domain with reduced vocabulary and syntactic structures • LUNAR – a front end to a database on lunar rocks • SABRE – reservation system (uses a speech recognition front end and a database backend) • SHRDLU – a blocks world system that permitted NLU input for commands and questions • what is sitting on the red block? • what shapes is the blue block on the table? • place the green pyramid on the red brick • Aside from the reduced complexity with limited vocabulary/syntax, we can also derive a useful representation for the domain • in general though, what representation do we use?
MARGIE, SAM & PAM • In the 70s, Roger Schank presented his CD theory as an underlying representation for language • MARGIE would input words and build a structure from each sentence • This structure in composed almost entirely from keywords and not from syntax, pulling up case frames from a case grammar for a given word (we cover case grammars later) • The structure would give MARGIE a memory for question answering and prediction • SAM would use underlying scripts to map sentences onto for storage to reason about typical actions and activities • PAM would take input from a story and store input sentences in CDs to reason over the story plot and characters
NLU Through Mapping • The typical NLU solution is by mapping from primitive components of language up through worldly knowledge (similarly to SR mappings) • prosody – intonation/rhythm of an utterance • phonology – identifying and combining speech sounds into phonemes/syllables/words • morphology – breaking words into root, prefix and suffix • syntax – identifying grammatical roles of words, grammatical categories of clauses • semantics – applying or identifying meaning for each word, each phrase, the sentence, and beyond • discourse/pragmatics – taking into account references, types of speech, speech acts, beliefs, etc • world knowledge – understanding the statement within the context of the domain • the first two only apply to speech recognition • Many approaches to each mapping
Morphology • In many languages, we can gain knowledge about a word by looking at the prefix and suffix attached to the root, for instance in English: • an ‘s’ usually indicates plural, which means the word is a noun • adding ‘-ed’ makes a verb past tense, so words ending in ‘ed’ are often verbs • we add ‘-ing’ to verbs • we add de-, non-, im-, or in- to words • Many other languages have similar clues to the word’s POS through the prefix/suffix • Morphology by itself is insufficient to tag a word’s POS, morphology provides additional clues for both POS and the semantic analysis of the word
Morphological Analysis • Two basic processes • stemming – breaking the word down to its root by simply removal of a prefix or suffix • may not be easy to do as some words have letters that are the same as a prefix or suffix but are not, such as defense (de- is not a prefix) as opposed to decrease or dehumidify • often used when the suffix/prefix is not needed such as when doing a keyword search where only the root is desired • lemmatization – obtaining the root (known as the lemma) of the word through a more proper morphological analysis of the word (combined with knowledge of the vocabulary) • is, are, am be • There are many approaches for both processes
Approaches • Dictionary lookup – store all word stems, prefixes and suffixes • most words have no more than 4 prefix/suffixes so our dictionary is not increased by more than a factor of 4 • Translate the vocabulary (including history of words) into a finite-state transducer • follow the FST to a terminal node based on matching letters • Hand-coded rules • which will combine stems + prefix/suffixes with the location of the word in a sentence (its POS, surrounding words) • Statistical approaches (trained using supervised learning)
Syntactic Analysis • Given a sentence, our first task is to determine the grammatical roles of each word of the sentence • alternatively, we want to identify if the sentence is syntactically correct or incorrect • The process is one of parsing the sentence and breaking the components into categories and subcategories • “The big red ball bounced high” • break this into noun phrase and verb phrase, break the noun phrase into article, adjective(s), noun, etc • Generate a parse tree of the parse • Syntactic parsing is computationally complex because words can take on multiple roles • we generally tackle this problem in a bottom-up manner (start with the words) but an alterative is top-down where we start with the grammar and use it to generate the sentence • both forms will result in our parse tree
POS Tagging • Before we do a syntactic parse, we must identify the grammatical role (POS) of each word • As many words can take on multiple roles, we need to use some form of context to fully identify each word’s POS • for instance, “can” has roles as • a noun (a container) • an auxiliary verb (as in “I can climb that mountain”) • a verb (as in “We canned the sliced peaches”) • it can also be part of a proper noun (the dance the can can) • how about the sentence: “We can can the can” • we might try to generate all possible combinations of tags and select the best grouping (most logical grouping) or use some statistical or rule-based means
Rule-based POS • The oldest approach and possibly the one that will lead to the most accurate results • But also the approach that takes the most effort • Start with a dictionary • Specify rules based on an analysis of linguistic features of the given word • a rule will contain conditions to test surrounding words (context) to determine a word’s POS • some rules may be word-specific, others can be generic • example: if a word can be a noun or verb and follows “the” then select noun • Rules can also be modified or learned using supervised learning techniques
POS by Transformation (Brill tagging) • First, an initial POS selection is made for each word in the sentence (perhaps the most common role for a given word, or the first one in the list or even random) • Second, transformational rules are applied to correct tags that fail because of some conditional test such as “change from modal to noun if previous tag is a determiner” • we can also pre-tag any word that has 100% certainty (e.g., a word that has only 1 grammatical role, or a word already tagged by a human) • Supervised learning can be used to derive or expand on the rules (otherwise, rules must be hand-coded as with the rule-based approach) although unsupervised learning can also be applied but this leads to a reduced accuracy • aside from high accuracy, this approach does not risk overfitting the data that stochastic/HMM approaches might, it is also easier to interpret the output over stochastic/HMM
Statistical POS • The most common POS approach is the HMM • the model consists of possible sequences of grammatical components for the given word sequence • e.g., Det – Adj – N – aux – V – Prep – Det – Adj – N • A corpus is used to train the transition probabilities • we are interested in sequences longer than trigrams because grammatical context can carry well beyond just 3 words but most HMM approaches are limited to bigrams and trigrams • Emission probabilities are the likelihood that a word will take on a particular role • these probabilities must be generated through supervised learning (marked up corpus) • although unsupervised learning may also provide reasonable probabilities using the E-M algorithm • Notice the independence assumption – the HMM does not take into account such decisions as two nouns in a row or 10 adjectives in a row
Maximum Entropy POS • The drawbacks of the HMM (no context in terms of history of POSs in the sentence and only 2-3 word transition probabilities) can be resolved by adding a history • the ME approach adds context • Templates (much like the transformational approach with rules) are predefined where discriminating features are learned • the approach is to compute the most probabilistic path through the best matching templates (maximize the “entropy”) under some constraints • features for a template typically consider a window of up to 2 words behind and 2 words ahead of the current word
Other Supervised Learning Methods • SVMs – train an SVM for a given grammatical role, use a collection of SVMs and vote, resistant to overfitting unlike stochastic approaches • NNs/Perceptrons – require less training data than HMMs and are quickly computationally • Nearest-neighbor on trained classifiers • Fuzzy set taggers – use fuzzy membership functions to for the POS for a given word and a series of rules to compute the most likely tags for the words • Ontologies – we can withdraw information from ontologies to provide clues (such as word origins or unusual uses of a word) – useful when our knowledge is incomplete or some words are unknown (not in the dictionary/rules/HMM)
Unsupervised POS • This is the most challenging approach because we must learn grammatical roles with little or no marked up data sets • But this is the most convenient because marking up a data set for supervised learning is very time consuming • One approach is to use a small marked up data set for initial training and then bootstrap up through unsupervised training that clusters around the concepts learned in the marked up data • Approaches include neural networks, rule induction, data mining-like clustering, decision trees and Bayesian approaches
Syntactic Parsing • With all words tagged, we then put together the sentence structure through parsing • if POS has selected 1 tag per word, we still have variability with the parse of the sentence • consider: The man saw the boy with a telescope • the prepositional phrase “with a telescope” could modify “saw” (how the man saw the boy) or “the boy” (he saw the boy who has or owns a telescope) • Put the block in the box on the table • does “in the box” modify the block or “on the table”? • As with POS, there are many approaches to parsing • the result of parsing is the grouping of words into larger constituent groups, which are hierarchical so this forms a parse tree
Parse Tree Example • A parse tree for a simple sentence is shown to the left • notice how the NP category can be in multiple places • similarly, a NP or a VP might contain a PP, which itself will contain a NP • Our parsing algorithm must accommodate this by recursion
Context-free Grammars (CFG) • A formal way to describe the syntactic forms for legal sentences in a language • the CFG is defined as G=(S, N, S, R) where S is the start state, R are a set of production rules that map nonterminal symbols (N) into other nonterminal symbols and terminal symbols (S) • rules are “rewrite rules” which rewrite a more general set of symbols to a more specific set • for instance, NP DetAdj* Noun and Det the | a(n) • A parse tree for a sentence denotes the mappings of the nonterminal symbols through the rules selected into nonterminal/terminal symbols • CFGs can be used to build parsers (to perform syntactic analysis and derive parse trees)
Parsing by Dynamic Programming • Also known as chart parsing, which can be top-down or bottom-up in nature depending on the order of “prediction” and “scan” • Parse left to right in the sentence by selecting a rule in the grammar that matches the current word’s POS • Apply the rule and keep track of where we are with a dot (initial, middle, end/complete) • the chart is a data structure, a simple table that is filled in as processing occurs, using dynamic programming • The chart parsing algorithm consists of three parts: • prediction: select a rule whose LHS matches the current state, this triggers a new row in the chart • scan: the rule and match to the sentence to see if we are using an appropriate rule • complete: once we reach the end of a rule, we complete the given row and return recursively
Example • Simple example of the sentence “Mary runs” • Processing through the grammar: • S . N V predict: N V • N . mary predict: mary • N mary . scanned: mary • S N . V completed: N; predict: V • V . runs predict: runs • V runs . scanned: runs • S N V . completed : V, completed: S • The chart: • S0: [($ --> . S), start (S --> . Noun Verb)] predictor • S1: [(Noun --> mary .), scanner (S --> Noun . Verb)] completer • S2: [(Verb --> runs .)] scanner (S --> Noun Verb .), completer ($ --> S .)] completer
Parsing by TNs • A transition network is a finite state automata whose edges are grammatical classifications • A recursive transition network is the same, but can be recursive • we use the RTN because of the recursive nature of natural languages • Given a grammar, we can automatically generate an RTN by just “unfolding” rules that have the same LHS non-terminal into a single graph • Use the RTN by starting with a sentence and following the edge that matches the grammatical role of the current word in our parse – this is a bottom-up parsing • we have a successful parse if we reach a state that is a terminating state • since we traverse the RTN recursively, if we get stuck in a deadend, we have to backtrack and try another route
Example Grammar and RTN S NP VP S NP Aux VP NP NP1 Adv | Adv NP1 NP1 Det N | Det Adj N | Pron | That S N Noun | Noun Rrel etc…
RTN Output • The parse tree below shows the decomposition of a sentence S (John hit the ball) into constituents and those constituents into further constituents until we reach the leafs (words) • the actual output of an RTN parser is a nested chain of constituents and words, generated from the recursive descent through the chart parsing or RTN [S [NP (N John)] [VP [V hit] [NP (Det the) (N ball)] ] ] ]
Augmented Transition Networks • The RTN only provides the constituent hierarchy but while parsing, we could potentially obtain useful information for semantic analysis • we can augment each of the RTN links to have code that annotates constituent elements with more information such as • is the NP plural? • what is the verb’s tense? • what might a reference refer to? • We use objects to describe each word where objects have additional variables to denote singular/plural, tense, root (lemma), possibly prefix/suffix information, etc (see the next slide) • This is an ATN, which makes the transition to semantic analysis somewhat easier
ATN Example • Each word is tagged by the ATN to include its part of speech (lowest level constituent) along with other information, perhaps obtained through morphological analysis
Statistical Parsing • The parsing model consists of two parts • Generative component to generate the constituent layout for the sentence • This may create a parse tree or a dependency structure (see below) where we annotate words by their dependencies between each other (these are functional roles which are not quite the same as grammatical roles) • Evaluative component to rank the output of the generative component • the evaluator may or may not provide probabilities but all it needs to do is rank all of the outputs
Probabilistic CFGs • Here, the generator is a CFG as we had with the non-probabilistic approaches • The evaluator computes for each CFG rule its likelihood and each possible parse for the sentence is merely the probability of applying the sequence of rules that caused that parse to occur • We need training data to acquire the probabilities for each rule (although unsupervised training is also possible but less accurate) This approach assumes independency of rules which is not true and so accuracy can suffer
History-Based Model • Here, we compute the probability of a sequence of tags • This uses a history that includes the probability of a sequence of tags both before the given word and after • Example: Last week Marks bought Brooks. Note that this approach does not constrain the generative model to fit an actual grammar (that is, sentences do not have to be grammatically correct) We will need a great deal of data to compute the needed probabilities and since it is unlikely that we will have such a detailed set, we will have to smooth the data that we do have to fit this model
PCFG Transformations • The main drawback of PCFG is not retaining a history • We can annotate a generated constituent by specify what non-terminal led to this particular item such as • NP^S (S NP) • NP^VP VP Aux V NP • notice that we are not encoding the whole history like the previous approach but we also need far fewer probabilities here (bigrams) • Another approach is to provide more finely detailed grammar rules (have more categories) whereby each rule only maps to 1 or 2 items terminal/non-terminals on the right hand side • again limiting the number of probabilities by using bi and trigrams
Discriminative Models • Here we compute the probability of an entire parse as a whole, that is P(parse | words) • a local model attempts to find the most probable parse by concentrating on the most probable parse of each word (thus, a local discrimination) • this uses a history based approach like those we just talked about except that we do not need exact (or precise) probabilities since we are ranking our choices • a global model computes the probability of each parse and selects the largest one • this approach has the advantage of being able to incorporate any type of language feature we like such as specialized rules that are not available in the dataset • the biggest disadvantage is its computational complexity • With this approach, we are not tying together the generative and evaluation models so we can use any form of generative model
Semantic Analysis • Now that we have parsed the sentence, how do we proscribe a meaning to the sentence? • the first step is to determine the meaning of each word (WSD) • next, we attempt to combine the word meanings into some representation that conveys the meaning of the utterance • this second step is made easier if our target representation is a command such as a database query (as found in LUNAR) or an OS command (as found in DWIM – do what I mean) • in general though, this becomes very challenging • what form of representation should the sentence be stored in? • how do we disambiguate when words have multiple meanings? • how do we handle references to previous sentences? • what if the sentence should not be taken literally?
Word Sense Disambiguation • Even if we have a successful parse, a word might have multiple meanings that could alter our interpretation • Consider the word tank • a vat/container (noun) • a transparent receptacle for fish (noun) • a vehicle/weapon (noun) • a jail cell (noun) • to fill (as in a car tank) (verb) • to drink to the point of being drunk (verb) • to lose a game on purpose (verb) • We also have idiomatic meanings when applied with other words like tank top and think tank
Semantic Grammars • In a restricted domain and restricted grammar, we might combine the syntactic parsing with words in the lexicon • this allows us not only find the grammatical roles of the words but also their meanings • the RHS of our rules could be the target representations rather than an intermediate representation like a parse • S I want to ACTION OBJECT | ACTION OBJECT | please ACTION OBJECT • ACTION print | save | … • print lp • OBJECT filename | programname | … • filename get_lexical_name( ) • This approach is not useful in a general NLU case
Word Sense Disambiguation • We need to seek clues in the use of the word to help figure out its word sense • Consider the word plant as a noun: manufacturing/processing versus life form • since both are nouns, knowing POS is not enough • an adjective may or may not help: nuclear plant versus tropical plant but if the nuclear plant is in the tropics then we might be misled • on the other hand, knowing that one sense of the word means a living thing while the other sense is a structure (building) or is used in manufacturing or processing might help • How much of the sentence (and preceding sentences) might we need to examine before we obtain this word’s sense?
Features for Word Sense Disambiguation • To determine a word’s sense, we look at the word’s POS, the surrounding words’ POS’ and what those words are • Statistical analysis can help tie a word to a meaning • “Pesticide” immediately preceding plant indicates a processing/manufacturing plant but “pesticide” anywhere else in the sentence would primarily indicate a life form plant • The word “open” on either side of plant (within a few words) is equally probable for either sense of the word plant • the window size for comparison is usually the same sentence although it has been shown that context up to 10,000 words away could still impact another word! • For “pen”, a trigram analysis might help, for instance “in the pen” would be the child’s structure, “with the pen” would probably be the writing utensil
Rule-based/Frame-based • We could encode for every word its possible senses by means of rules that help interpret the surrounded words • One early attempt (1975) was to provide a flowchart of rules for each of the 1815 words in the system’s vocabulary which included looking for clues among morphology, collocations, POS and exact word matches • Or we could annotate the words with frames that list expectations • e.g., plant: living thing: expect to see words about living things, structure: expect to see words about structures • we could annotate words with semantic subject codes (EC for economic/finance, AU for automotive)
Semantic Markers • One approach is through semantic markers • we affix such markers to nouns only and then look at verbs and other words to determine the proper sense • Example: I will meet you at the diamond • diamond can be • an abstract object (the geometric shape) • a physical object (a gem stone, usually small) • a location (a baseball diamond) • look for clues in the sentence that we are referring to an abstract object, physical object, location • the phrase “at the” indicates a location • this could be erroneous, we might be talking about meeting up at the exhibit of a large diamond at some museum
Case Grammars • Rather than tying the semantics nouns of the sentence, we will tie roles to a verb and then look to fill in the roles with the words in the sentence • for instance, does this verb have an agent? an object? an instrument? • to open: [Object (Instrument) (Agent)] • we expect when something is open to know what was opened (a door, a jar, a window, a bank vault) and possibly how it was opened (with a door knob, with a stick of dynamite) and possibly who opened it (the bank robber, the wind, etc) • semantic analysis becomes a problem of filling in the blanks – finding which word(s) in the sentence should be filled into Object or Instrument or Agent
Case Grammar Roles • Agent – instigator of the action • Instrument – cause of the event or object used in the event (typically inanimate) • Dative – entity affected by the action (typically animate) • Factitive – object or being resulting from the event • Locative – place of the event • Source – place from which something moves • Goal – place to which something moves • Beneficiary – being on whose behalf the event occurred (typically animate) • Time – time the event occurred • Object – entity acted upon or that is changed • To kill: [agent instrument (object) (dative) {locative time}] • To run: [agent (locative) (time) (source) (goal)] • To want: [agent object (beneficiary)]
Probabilistic WSD • All of the previous approaches required that humans create the rules/templates/annotations • this is both time consuming and impractical if we want to cover all of English or hand annotate tens of thousands of sentences • Instead, we could learn the rules or do probabilistic reasoning (e.g., HMM trained through bigrams or trigrams) • for a supervised form of learning, we will probably need hundreds to thousands of annotated sentences (we need enough data to handle the variability we will find in the differing word senses) • We can also use partially supervised or unsupervised approaches
Supervised WSD • Given the corpus where words are annotated with appropriate features, we could use • Decision trees and decision lists (a decision list is a binary function which seeks out features in the data of a class and returns whether the data fits the class or not, we could provide such a function for each word’s sense) • Naïve Bayesian classifiers • Nearest neighbor • SVMs (with boosting in many cases) • When there is plenty of data, simpler approaches are often successful (e.g., NBC) and when there is highly discriminative data, decision trees/lists work well, when there is sparse data, SVMs and boosting tend to work the best
Other Approaches • There may not be sufficient annotated data for supervised training • Lightly (or minimally) supervised training where we have an enumerated list of word sense memberships (rather than fully annotated sentences) or by using other sources of knowledge (e.g., WordNet) to provide class information can be used • a variation is known as iterative bootstrapping where a small hand annotated collection of data is used for training and then untrained data is annotated via what has been learned to enlarge the training data, adding annotations/corrections as needed • Unsupervised clustering can also be applied to determine for a given word various possible senses of that word (what it doesn’t do is necessarily define those senses)