320 likes | 455 Views
Lightweight Natural Language Database Interfaces. Jun. 23, 2004. In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang. Dept. of Computer Science & Engineering Pohang University of Science and Technology (POSTECH) R. of KOREA. Contents. Motivations Introduction to NLDBI
E N D
Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang Dept. of Computer Science & Engineering Pohang University of Science and Technology (POSTECH) R. of KOREA
Contents • Motivations • Introduction to NLDBI • Issues & our concerns • Two motivations • Lightweight architecture • Lightweight NLDBI • Domain adaptation • Question answering • Conclusion
Introduction • Natural Language DataBase Interfaces (NLDBI) • Access database data in natural languages [Androutsopoulos,1995] • Main components Natural Language Question Answer Meaning Representation Database Query DBMS Analysis Translation Linguistic Knowledge Translation Knowledge
Terminology • Domain class • Refers to a table or a column • (e.g.) T_Customer, C_ID, C_Name • Domain class instance • Individual column value • (e.g.) 1034, 1035, “Bill Clinton”, “Jimmy Carter” • Class term • A lexical term referring to a domain class, such as “customer” • Value term • A lexical term indicating a domain class instance, such as “Bill” T_Customer C_ID C_Name 1034 Bill Clinton 1035 Jimmy Carter
NLDBI Issues & Our Concerns • Process • Natural language understanding • Spoken language & meaning representation • Discourse analysis & dialogue model • Database query conversion (NL DB) • Paraphrase problem : M-to-1 • Translation ambiguity problem: 1-to-N • Natural language generation • Co-operative answering • Knowledge management • Linguistic knowledge • Translation knowledge • Representation, acquisition • Domain transportability problem
Motivation 1 • Previous translation knowledge acquisition • Complex translation knowledge representation • Expensive expertise required (AI/NLP/DBMS/Domain knowledge) • (e.g.) Devise conversion rules from parse trees to database query exp. • (e.g.) Define database relations for logical predicates • Difficulty in initial creation and scalable expansion • Cause domain transportability problem • No general solution • As one solution , domain tool methods are tightly coupled with underlying NLDBI systems • (e.g.) IRUS, CHAT-80, ASK, EUFID, TEAM, MASQUE, … • Our proposal • Semi-Automatic Acquisition by Simplifying Translation Knowledge Structures
Motivation 2 • Translation ambiguity • Class term ambiguity • A class term refers to several domain classes • ‘address’ TB_Customer.Address, or TB_Employee.Address • Value term ambiguity • A value term refers to several domain class instances • ‘London’ TB_Flight.departure, or TB_Flight.arrival • Resolution of translation ambiguity • So far, no systematic disambiguation scheme • We propose a Noun Translation Technique based on an Information Retrieval Framework
Contents • Motivations • Lightweight NLDBI • Domain adaptation • Semi-automatic acquisition of translation knowledge • Physical Entity-Relationship Schema (pER schema) • Translation knowledge structures • Translation knowledge construction • Examples • Question answering • Conclusion
Semi-Automatic Acquisition • Procedure • Linguistic annotation by domain experts Initial Trans. Know. DB Linguistic Annotation Reverse Engineering Automatic Extraction Physical schema pER schema Within a DB modeling tool Input Guidelines • To each domain class, give a Linguistic Name (in the form of NP) • Make any linguistic description (called Domain Sentence) about or among domain classes (in the form of simple sentences). • In , an NP referring to a domain class should be either its linguistic name defined in , or a domain class itself
Physical Entity-Relationship (pER) Schema • pER schema = pER graph + pER description • pER graph = a physical schema • Encode structural constraints among DB objects • Property-of b/w an entity and its attributes • Semantic relationship among entities and/or attributes • pER description = linguistic annotations on a pER graph • Bridge b/w DB objects and natural language expressions
Translation Knowledge Structures • Class-referring info. (for paraphrase problem) • Class document for each domain class • Synonymous class terms, and their concept codes • Value document for each column • All-length ngrams / pattern-based 2grams generated from column data • Class-constraining info. (for translation ambiguity problem) • Valency-based selection restrictions • Domain verbs or case markers impose on domain classes • order, {T_Customer, T_Product, T_Order.Date} • from, {T_Flight.Departure} , to, {T_Flight.Arrival} • Collocation document for each domain class • Linguistic collocations of a domain class
Translation Knowledge Construction NL Description DB column data Linguistic names Domain sentences Syntactic Analysis N-gram Value Indexing Class Term Extraction Value Terms Class Terms Valency-based Value Doc. Class Doc. Collocation Doc. Concept hierarchy Class-Referring information Class-Constraining information
Contents • Motivations • Lightweight NLDBI • Domain adaptation • Semi-automatic acquisition of translation knowledge • Physical Entity-Relationship Schema (pER schema) • Translation knowledge structures • Translation knowledge construction • Examples • Question answering • Conclusion
Semi-Automatic Acquisition: Physical DB Schema • Physical DB schema for a university course domain Reverse-Engineering by a DB modeling tool
Semi-Automatic Acquisition:NL Descriptions • Domain experts annotate NL descriptions on physical schema
Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge Domain class: T2C2 Linguistic name All column values (non-alphanumeric) ‘Course name’ Statistics, Algorithms Class document: T2C2C Value document: T2C2V ‘Course name’ ‘Name’ Statistics Algorithms
Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge • Class and Value documents from linguistic names and DB tuples
Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Constraining Translation Knowledge All domain sentences Linguistic name: T1C3 “Students take courses in ‘T3C3’” Entrance year Take-student Take-course Take-(in) T3C3 Class document: T1 Student Take student, course, T3C3 Class Documents Collocation document: T1C3 Take Entrance, student T1, T2, T3C3
Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Constraining Translation Knowledge Collocation-based selection restriction Valency-based selection restriction
Semi-Automatic Acquisition: Expansion of Initial Translation Knowledge Initial Class Document: T1C3 Extended Class Document: T1C3 Instructor, professor Instructor, teacher0, educator1, pedagogue1, professional2, professional_person2, adult3, grownup3, person4, individual4, someone4, somebody4, mortal4, human4, soul4 Professor, academician1, academic1, faculty_member1, educator2, pedagogue2, professional3, professional_person3, adult4, grownup4, person5, individual5, someone5, somebody5, mortal5, human5, soul5 Person, … Adult, … Educator, … Instructor, … Paraphrase expansion by WordNet
Contents • Motivations • Lightweight NLDBI • Domain adaptation • Question answering • Question analysis • Noun translation • Class retrieval • Class disambiguation • Query graph & SQL generation • Conclusion
Question Analysis & Noun Translation • Question analysis by parsing • A set of question nouns • Each noun has features: question focus, value operator, etc. • A set of predicate-argument (P-A) pairs • Noun translation (or Domain class tagging) • Given a question noun, find the most probable domain class Class retrieval • Retrieve candidate domain classes for each question noun • Lexically or conceptually equivalent domain classes Class disambiguation • Select the most likely domain class
Question Analysis & Noun Translation • Question : “Show me the names of students who got A in statistics from 1999”
Class Retrieval • Information Retrieval (IR) framework • Translation knowledge a target document collection • Class/value/collocaton documents, valency-based selection restrictions • A question noun an IR query • Class term a surface word form & concept codes • ‘customer’, ‘product’ • Linguistic value term all-length n-grams for Korean • ‘Bill’, ‘Bush’ • Alphanumeric value term pattern-based 2-grams C1 : 1-byte character, C2 : 2-byte character, N : decimal, S : special character
Class Disambiguation • Definition of a class retrieval function • Notation RC(t) means a set of domain classes retrieved from a document collection C using a query term t • Rref(t): retrieves from ref (a set of class/value documents) • Rval(t): retrieves from val (valency-based constraints) • Consider valency-based constraints as documents • Rcol(t): retrieves from col (collocation-based documents) • Class disambiguation by Boolean retrieval model • Valency-based • Rref(t) Rval(head(t)) • Collocation-based • Rref(t) Rcol(adjacent(t))
Class Retrieval & Class Disambiguation Q: Show me the names of students who got A in statistics from 1999 Head Verb of ‘1999’ ‘Get’ Question Noun ‘1999’ Class/Value Documents Valency-Based Constraints Relevant Domain Classes {T1C3v, T3C3v } Valency-Based Constraint Get: {T1, T3, T3C4, T3C3} Value Term Ambiguity Disambiguation Rref(‘1999’) Rval(head(‘1999’)) = {T3C3v }
Class Retrieval & Class Disambiguation Q: Show me the names of students who got A in statistics from 1999 Adjacent Word of ‘Name’ ‘Student’ Question Noun ‘Name’ Class/Value Documents Collocation Documents Relevant Domain Classes {T1C2c, T2C2c} Collocation-Based Constraint {T1C1, T1C2, T1C3} Class Term Ambiguity Disambiguation Rref(‘Name’) Rcol(adjacent(‘Name’)) = {T1C2c }
Query Graph & SQL Generation • Query graph • A minimal connected sub-graph • A node is a disambiguated domain class for each question noun • Query graph is located from a physical schema graph using a Meng’s method (Meng et al. 1999) • SQL generation from a query graph • Entity nodes SQL-FROM • Arcs b/w entity nodes Join operations in SQL-WHERE • From question analysis • Domain class having question focus feature SQL-SELECT • Domain class having value operator feature SQL-WHERE
Query Graph & SQL Generation SELECT T1C2 FROM T1, T2, T3 WHERE T1.T1C1 = T3.T3C1 and T2.T2C1 = T3.T3C2 and T2C2 = ‘Statistics’ and T3C3 = ‘A’ and T3C4 >= 1999
Conclusion • Lightweight NLDBI • Domain adaptation (to deal with a paraphrase problem) • Simplification of translation knowledge in the form of documents • Semi-automatic construction of translation knowledge • Expansion of translation knowledge by dictionary • Question answering (to resolve translation ambiguities) • Noun translation technique based on an IR framework • Class retrieval • Class disambiguation
Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge All column values (alphanumeric) Domain class: T1C1 1999-0011, 2001-0027 1-byte char C 2-byte char C Special char S Decimal N Linguistic name ‘Student identification number’ n4s1n4 Class document: T1C1C n-grams ‘Student identification number’ ‘Identification number’ ‘Number’ Value document: T1C1V n4s1n4, n4, s1, n4s1, s1n4