710 likes | 989 Views
Learning for Semantic Parsing of Natural Language. Raymond J. Mooney Ruifang Ge, Rohit Kate, Yuk Wah Wong John Zelle, Cynthia Thompson. December 19, 2005. Syntactic Natural Language Learning.
E N D
Learning for Semantic Parsing of Natural Language Raymond J. Mooney Ruifang Ge, Rohit Kate, Yuk Wah Wong John Zelle, Cynthia Thompson December 19, 2005
Syntactic Natural Language Learning • Most computational research in natural-language learning has addressed “low-level” syntactic processing. • Morphology (e.g. past-tense generation) • Part-of-speech tagging • Shallow syntactic parsing (chunking) • Syntactic parsing
Semantic Natural Language Learning • Learning for semantic analysis has been restricted to relatively “shallow” meaning representations. • Word sense disambiguation (e.g. SENSEVAL) • Semantic role assignment (determining agent, patient, instrument, etc., e.g. FrameNet, PropBank) • Information extraction
Semantic Parsing • A semantic parser maps a natural-language sentence to a complete, detailed semantic representation: logical form ormeaning representation (MR). • For many applications, the desired output is immediately executable by another program. • Two application domains: • CLang: RoboCup Coach Language • GeoQuery: A Database Query Application
CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players • The coaching instructions are given in a formal language called CLang If the ball is in our penalty area, then all our players except player 4 should stay in our half. Simulated soccer field Coach Semantic Parsing ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) CLang
GeoQuery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] How many cities are there in the US? User Semantic Parsing answer(A, count(B, (city(B), loc(B, C), const(C, countryid(USA))),A)) Query
Semantic-Parser Learner Semantic Parser Logical Form Natural Language Learning Semantic Parsers • Manually programming robust semantic parsers is difficult due to the complexity of the task. • Semantic parsers can be learned automatically from sentences paired with their logical form. NLLF Training Exs
Engineering Motivation • Most computational language-learning research strives for broad coverage while sacrificing depth. • “Scaling up by dumbing down” • Realistic semantic parsing currently entails domain dependence. • Domain-dependent natural-language interfaces have a large potential market. • Learning makes developing specific applications more tractable. • Training corpora can be easily developed by tagging existing corpora of formal statements with natural-language glosses.
Cognitive Science Motivation • Most natural-language learning methods require supervised training data that is not available to a child. • General lack of negative feedback on grammar. • No POS-tagged or treebank data. • Assuming a child can infer the likely meaning of an utterance from context, NLLF pairs are more cognitively plausible training data.
Our Semantic-Parser Learners • CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) • Separates parser-learning and semantic-lexicon learning. • Learns a deterministic parser using ILP techniques. • COCKTAIL(Tang & Mooney, 2001) • Improved ILP algorithm for CHILL. • SILT (Kate, Wong & Mooney, 2005) • Learns symbolic transformation rules for mapping directly from NL to LF. • SCISSOR (Ge & Mooney, 2005) • Integrates semantic interpretation into Collins’ statistical syntactic parser. • WASP(Wong & Mooney, in preparation) • Uses syntax-based statistical machine translation methods. • KRISP (Kate & Mooney, in preparation) • Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations.
S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball SCISSOR: Semantic Composition that Integrates Syntax and Semantics to get Optimal Representations • Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] • A statistical parser is used to generate a semantically augmented parse tree (SAPT) • Augment Collins’ head-driven model 2 to incorporate semantic labels • Translate SAPT into a complete formal meaning representation (MR) MR: bowner(player(our,2))
NL Sentence learner SAPT Training Examples SAPT TRAINING TESTING ComposeMR MR Overview of SCISSOR Integrated Semantic Parser
SCISSOR SAPT Parser Implementation • Semantic labels added to Bikel’s (2004) open-source version of the Collins statistical parser. • Head-driven derivation of production rules augmented to also generate semantic labels. • Parameter estimates during training employ an augmented smoothing technique to account for additional data sparsity created by semantic labels. • Parsing of test sentences to find the most probable SAPT is performed using a standard beam-search constrained version of CKY chart-parsing algorithm.
ComposeMR bowner player bowner null team player unum bowner 2 null null our player has the ball
ComposeMR bowner(_) player(_,_) bowner(_) null team player(_,_) unum bowner(_) 2 null null our player has the ball
player(team,unum) bowner(player) ComposeMR bowner(player(our,2)) bowner(_) bowner(_) bowner(_) bowner(_) player(our,2) player(_,_) player(_,_) null null team player(_,_) unum bowner(_) 2 null null our player has the ball
WASPA Machine Translation Approach to Semantic Parsing • Based on a semantic grammar of the natural language. • Uses machine translation techniques • Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) • Word alignments (Brown et al., 1993; Och & Ney, 2003) • Hence the name: Word Alignment-based Semantic Parsing
Synchronous Context-Free Grammars (SCFG) • Developed by Aho & Ullman (1972) as a theory of compilers that combines syntax analysis and code generation in a single phase • Generates a pair of strings in a single derivation
Compiling, Machine Translation, and Semantic Parsing • SCFG: formal language to formal language (compiling) • Alignment models: natural language to natural language (machine translation) • WASP: natural language to formal language (semantic parsing)
of STATE Ohio Context-Free Semantic Grammar QUERY What QUERY What isCITY is CITY CITY the capitalCITY the capital CITY CITY ofSTATE STATE Ohio
Productions of Synchronous Context-Free Grammars • Referred to as transformation rules in Kate, Wong & Mooney (2005) pattern template QUERY What isCITY /answer(CITY)
What is CITY answer ( CITY ) the capital CITY capital ( CITY ) loc_2 ( STATE ) of STATE stateid ( 'ohio' ) Ohio CITY ofSTATE / loc_2(STATE) CITY the capitalCITY / capital(CITY) QUERY What isCITY / answer(CITY) Synchronous Context-Free Grammars QUERY QUERY answer(capital(loc_2(stateid('ohio')))) Whatis thecapital of Ohio STATE Ohio / stateid('ohio')
Parsing Model of WASP • N (non-terminals)= {QUERY, CITY, STATE, …} • S (start symbol)= QUERY • Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} • Tn (NL words) = {What, is, the, capital, of, Ohio, …} • L (lexicon) = • λ (parameters of probabilistic model) = ? QUERY What isCITY / answer(CITY) CITY the capitalCITY / capital(CITY) CITY ofSTATE / loc_2(STATE) STATE Ohio / stateid('ohio')
CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) Probabilistic Parsing Model d1 CITY CITY capital capital ( CITY ) CITY of loc_2 ( STATE ) STATE Ohio stateid ( 'ohio' ) STATE Ohio / stateid('ohio')
CITY capital CITY / capital(CITY) CITY of RIVER / loc_2(RIVER) Probabilistic Parsing Model d2 CITY CITY capital capital ( CITY ) CITY of loc_2 ( RIVER ) RIVER Ohio riverid ( 'ohio' ) RIVER Ohio / riverid('ohio')
CITY capital CITY / capital(CITY) CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) CITY of RIVER / loc_2(RIVER) + + Probabilistic Parsing Model d1 d2 CITY CITY capital ( CITY ) capital ( CITY ) loc_2 ( STATE ) loc_2 ( RIVER ) stateid ( 'ohio' ) riverid ( 'ohio' ) 0.5 0.5 λ λ 0.3 0.05 0.5 0.5 STATE Ohio / stateid('ohio') RIVER Ohio / riverid('ohio') Pr(d1|capital of Ohio) =exp( ) / Z 1.3 Pr(d2|capital of Ohio) = exp( ) / Z 1.05 normalization constant
Parsing Model of WASP • N (non-terminals)= {QUERY, CITY, STATE, …} • S (start symbol)= QUERY • Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} • Tn (NL words) = {What, is, the, capital, of, Ohio, …} • L (lexicon) = • λ (parameters of probabilistic model) QUERY What isCITY / answer(CITY) CITY the capitalCITY / capital(CITY) CITY ofSTATE / loc_2(STATE) STATE Ohio / stateid('ohio')
Overview of WASP Unambiguous CFG of MRL Lexical acquisition Training set, {(e,f)} Lexicon,L Parameter estimation Parsing model parameterized by λ Training Testing Input sentence, e' Output MR, f' Semantic parsing
Lexical Acquisition • Transformation rules are extracted from word alignments between an NL sentence, e, and its correct MR, f, for each training example, (e, f)
Word Alignments • A mapping from French words to their meanings expressed in English Le programme a été mis en application And the program has been implemented
Lexical Acquisition • Train a statistical word alignment model (IBM Model 5) on training set • Obtain most probablen-to-1 word alignments for each training example • Extract transformation rules from these word alignments • Lexicon L consists of all extracted transformation rules
Word Alignment for Semantic Parsing • How to introduce syntactic tokens such as parens? The goalie should always stay in our half ( ( true ) ( do our { 1 } ( pos ( half our ) ) ) )
Use of MRL Grammar The RULE (CONDITION DIRECTIVE) goalie CONDITION (true) should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our top-down, left-most derivation of an un-ambiguous CFG stay UNUM 1 in ACTION (pos REGION) n-to-1 our REGION (half TEAM) half TEAM our
Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our stay UNUM 1 in ACTION (pos REGION) our TEAM REGION (half TEAM) half TEAM our TEAM our / our
REGION TEAMhalf / (half TEAM) Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our stay UNUM 1 in ACTION (pos REGION) REGION TEAM REGION (half TEAM) REGION (half our) half TEAM our
ACTION stay in REGION/ (pos REGION) Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our ACTION stay UNUM 1 in ACTION (pos REGION) ACTION (pos (half our)) REGION REGION (half our)
Probabilistic Parsing Model • Based on maximum-entropy model: • Features fi (d) are number of times each transformation rule is used in a derivation d • Output translation is the yield of most probable derivation
Parameter Estimation • Maximum conditional log-likelihood criterion • Since correct derivations are not included in training data, parameters λ* are learned in an unsupervised manner • EM algorithm combined with improved iterative scaling, where hidden variables are correct derivations (Riezler et al., 2000)
KRISP: Kernel-based Robust Interpretation by Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier is trained for each production with string subsequence kernel • These classifiers are used to compositionally build MRs of the sentences
Kernel Functions • A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) • For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some high-dimensional feature space • Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.)
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+?
String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11
Normalized String Subsequence Kernel • Normalize the kernel (range [0,1]) to remove any bias due to different string lengths • Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel • Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b]
ρ Support Vector Machines • SVM’s are classifiers that learn linear separators that maximize the margin between data and the classification boundary. • Kernel’s allow SVM’s to learn non-linear separators by implicitly mapping data to a higher-dimensional feature space.
Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs