340 likes | 361 Views
Lexical Model v3.3. International Technology Alliance In Network & Information Sciences. Dr David Mott (IBM UK). Purpose. A comprehensive CE model of how language works, consistent with linguistic practice allows us to build fact extraction applications configured by CE
E N D
Lexical Modelv3.3 International Technology Alliance In Network & Information Sciences Dr David Mott (IBM UK)
Purpose • A comprehensive CE model of how language works, consistent with linguistic practice • allows us to build fact extraction applications configured by CE • Allows specification of a rich lexicon (or dictionary) • how words express concepts in the CE domain model • how different forms of words are related (eg singular/plural) • representing semantic information to guide parsing and handle ambiguities • To be consistent in the future with existing lexical resources • Allows us to use Cambridge technology and lexicon (ERG, LKB) • To facilitate use of shallow and deep semantics in a consistent manner • Can build “simple” lexical processing but using the same model as more complex lexical processing
Simple link between word and concept “the show was a hit” the word |hit| conceptualise a ~ successful event ~ A that is an event. expresses the entity concept ‘successful event’ seems to be a simple basis for extracting facts from sentences ( this the way that the ACITA demonstration works) (the show p1) is a successful event.
But this is ambiguous “the show was a hit” conceptualise an ~ attack ~ A that is a situation. the word |hit| expresses the relation concept ‘attack’ expresses the entity concept ‘successful event’ This ambiguity could be solved by grammatical information (noun vs verb) We need a representation grammatical information conceptualise a ~ successful event ~ A that is an event. (the show p1) is ??????.
Grammatical Form • Defines how a word behaves grammatically • Represents: • the word • general part of speech, eg noun, verb • inflectional information, eg “hitting” is a form of “hit” (present participle, indicating something ongoing) • The ID is derived from the word and the “Penn tag” indicating part of speech: • |hit_nn| where “nn” is the Penn tag for a singular noun • Grammatical forms acting as different parts of speech are different: • the singular noun |hit_nn| • the present tense verb |hit_vbp| • Grammatical form is related to the word form: • the grammatical form |hit_nn| is written as the word |hit| • Different inflections are related: • the singular noun |hit_nn| is inflected to the plural noun |hits_nns| the type combines the general part of speech and the specific inflectional form
Using grammatical forms “the show was a hit” the word |hit| is written as conceptualise an ~ attack ~ A that is a situation. the present tense singular verb |hit_vbz| is written as expresses the relation concept ‘attack’ the singular noun |hit_nn| expresses The parser can determine part of speech, so one link is ruled out, leading to disambiguation the entity concept ‘successful event’ conceptualise a ~ successful event ~ A that is an event. (the show p1) is a successful event.
But this is STILLambiguous “the show was a hit” the word |hit| is written as conceptualise an ~ attack ~ A that is a situation. the present tense singular verb |hit_vbz| is written as expresses This ambiguity can only be solved by semantic information (different senses of the same noun) We need a representation word sense information the relation concept ‘attack’ the singular noun |hit_nn| expresses expresses the entity concept ‘successful event’ the entity concept ‘blow’ conceptualise a ~ successful event ~ A that is an event. conceptualise a ~ blow ~ A that is a physical contact. (the show p1) is ??????.
Word Sense • Defines the sense of a grammatical form (and hence of the word) • Specifies: • the grammatical form • general part of speech, eg noun, verb • a meaning in terms of the CE conceptual model • The ID is derived from the word in uppercase, the general part of speech and a unique number: • |HIT_n_1| where “n” is for noun • Each word sense is UNIQUE: • the noun sense |HIT_n_1| • the noun sense |HIT_n_2| • the verb sense |HIT_v_1| • Word sense is related to grammatical form: • the grammatical form |hit_nn| is a form of the noun sense |HIT_n_1| • Word sense is related to a SINGLE concept: • the noun sense |HIT_n_1| expresses the entity concept ‘successful event’ Traditional linguistics would define a word sense by predicate logic. But for us, CE is the logical formalism
Using Word Senses “the show was a hit” the word |hit| the present tense singular verb |hit_vbz| is written as conceptualise an ~ attack ~ A that is a situation. is written as is a form of the verb sense ‘HIT_v_1’ the singular noun |hit_nn| expresses the relation concept ‘attack’ is a form of is a form of the noun sense ‘HIT_n_1’ the noun sense ‘HIT_n_2’ How is this to be disambiguated, between ‘blow’ and ‘successful event’? This is more complex and requires semantics expresses the entity concept ‘successful event’ expresses the entity concept ‘blow’ conceptualise a ~ successful event ~ A that is an event. conceptualise a ~ blow ~ A that is a physical contact. (the show p1) is ??????.
Summary of basic concepts form grammar meaning
Building the Lexicon (1) same word, different forms
Building the Lexicon (2) ambiguous!
Stanford Parser Output (S (NP (DT the) (NN man)) (VP (VBD slept))) Note that the parser can determine part of speech and hence grammatical form but NOT the word sense, as this requires disambiguation of word sense and linking to CE concept model This is the CE rendering of the Stanford Parser
Dave’s two phase parsing the sentence phrase sp1 phrase node (S) Rules the situation s1 phrase node (NP) phrase node (VP) the noun phrase np1 stands for the verb phrase vp1 stands for the thing t1 word node (VBD) “slept” the past tense verb |slept| word node (DT) “the” word node (N) “man” the determiner |the| the noun |man| Structural (parent/child/sibling) Lexical (head/dependent)
Process of building meaning Syntactic Parser, including pos tagging and lemmatisation word the word |hits| lookup in lexicon grammatical form the plural noun |hits_ns| has the word |hit| as lemma. 1-1 lookup in lexicon word sense Grammatical disambiguation (POS tagging) the noun sense |HIT_n_1| CE concept Semantic disambiguation (context, phrase structures, constraints, avoiding inconsistencies) the entity concept hit
Features • Features may be added to grammatical forms and word senses: • “he” is singular and male • “loves” is present tense • Features can: • guide the parsing and rule out inconsistent parses • help to disambiguate words • provide some additional information about what is being said • timing information, classifications of entities involved • Important to represent features in our lexicon
Where do we place features? what other phrases can/must/must not occur in a sentence containing this verb? what roles and constraints must occur in the situation?
symbol to capture semantic features that do not affect the form of the word, eg transitivity, required complements Abstraction capturing a unique meaning Set of entries. organised with word (via citation form) and set of meanings Lexv3 model lexicon meaning syntactic element contains agentive verb concepthit patient-object verb word sense|HIT_1_n| phrase No concept of grammar, just what is written. expresses conceptual sense structural sense has as inflected form word|hit| noun sense pronoun sense has as text has as lemma verb sense determiner sense adjective sense has as base form “hit” adverb sense is written as is a form of characters delimited by whitespace, no punctuation simple word|hit| grammatical category Traditional categories of grammatical forms Not “is-a”, to separate the set of grammatical forms from the set of words has [] as word sequence conjunction|and_cc| nominal adverb|very_rb| grammatical form|hit_nns| adjective|nice_jj| is an inflection of compound word|ice bucket| noun preposition|on_pp| pronoun verb proper noun possessive pronoun Points to the base form of the grammatical form auxillary verb has as number has as gender has as person ... lexical verb common noun to capture grammaticalised features that affect the form of the word eg number, gender, person etc Fundamental grammatical object to represent form. Combines aspects of “written” (via word-or-words) and “pos” personal pronoun inflectional form base verb|hit_vb| present verb present third singular verb singular noun|hit_nn| Key plural noun|hits_nns| type with sample instance name “is a” 1-1 relation or attribute N-1 1-N present participle word |hit| past tense verb past participle |hit_vbn|
symbol to capture semantic features that do not affect the form of the word, eg transitivity, required complements Abstraction capturing a unique meaning Set of entries. organised with word (via citation form) and set of meanings Lexv3 model lexicon meaning syntactic element contains agentive verb concepthit patient-object verb word sense|HIT_n_1| No concept of grammar, just what is written. has as headword expresses conceptual sense expresses structural sense has as inflected form word|hit| noun sense pronoun sense has as lemma has as text verb sense determiner sense adjective sense has as base form “hit” adverb sense is written as expresses is a form of characters delimited by whitespace, no punctuation simple word|hit| grammatical category Traditional categories of grammatical forms Not “is-a”, to separate the set of grammatical forms from the set of words has [] as word sequence conjunction|and_cc| nominal adverb|very_rb| grammatical form|hit_nns| adjective|nice_jj| is an inflection of compound word|hit man| noun preposition|on_pp| pronoun verb proper noun possessive pronoun Points to the base form of the grammatical form auxillary verb has as number has as gender has as person ... lexical verb common noun to capture grammaticalised features that affect the form of the word eg number, gender, person etc Fundamental grammatical object to represent form. Combines aspects of “written” (via word-or-words) and “pos” personal pronoun inflectional form has as text has as pos base verb|hit_vb| present verb “hit” “noun” present third singular verb singular noun|hit_nn| Key plural noun|hits_nns| type with sample instance name “is a” 1-1 relation or attribute N-1 1-N present participle word |hit| past tense verb The name of a (limited number) of parts of speech, eg “noun”, “verb”, “adjective”, “adverb”, “preposition” past participle |hit_vbn| Some useful inferred relations
How do we disambiguate using semantics? • Lexical semantics: • Words, especially verbs, must behave in certain ways in sentences • must be number agreement between subject and verb *we loves* • a verb may require an object *I hit* • a verb may not permit an object *I fall dogs* • World semantics • situations may require certain roles: • an “attack situation” must have a volitional agent • “Christian market attacked …” must be a telegraphic form of “Christian market was attacked” • specific rules may apply: • all military vehicles are maintained by the owning nation • “the tank was mended by the Polish student” suggests that “tank” cannot be an armoured vehicle, but may be a water tank • a person may not be married to two people • “John married Sue, he thinks his wife is beautiful” suggests that “his wife” refers to Sue • These are all constraints on the interpretation of a sentence • lexical parses that fail these constraints are invalid • sets of extracted CE sentences that fail these constraints are inconsistent • If there are alternative hypotheses for an interpretation, and all but one are ruled out, then the remaining one may be the correct interpretation. THE major research topic?
Different Types of Semantics • We assume that lexical semantics is specific to a word sense and thus be represented at the word sense level: • the verb sense |HIT_v_1| “must be in a phrasal structure where there is a direct object” • World semantics will be represented in situation roles etc: • the entity concept hit situation has an agent role and a patient role we assume that the roles are mandatory
Constraints in BPP11 and BPP13 “Christian market attacked …” if ( there is a volitional situation named S that has the thing A as agent role ) and ( the thing A is a nonvolitional thing) then ( there is an inconsistency named I that has the thing S as object ). rule out active voice? BPP11 when ( there is situation named S that has the thing A as agent role ) then it must not be that ( the thing A is a nonvolitional thing ) and ( the situation S is a volitional situation ). BPP13 ?
Verb constraints • The verb “eat” takes a direct object: • “the boy eats the biscuit” • but may omit it: • the boy eats • semantically it is implied that something is eaten, but lexically it can be omitted • The verb “hit” takes a direct object: • “the boy hits the dog” • but the object cannot be omitted: • *the boy hits* • lexically this is not permitted • (This is called valency) How can we represent this?
Word Sense Categories • Categories of word senses define the phrasal context in which they may occur: • agent-subject verb • there must be an NP in “subject” position, which becomes the agent of the situation • patient-object verb • there must be an NP in “object” position, which becomes the patient of the situation • optional-patient-object verb • there may be an NP in “object” position, and if so becomes the patient of the situation • no-object verb • there must not be an NP in “object” position • Example: • |eat_v_1| is an agent-subject verb and an optional-patient-object verb: “I eat the food”, “I eat” • |hit_v_1| is an agent-subject verb and a patient-object verb: “ I hit the dog”, *I hit* • |sleep_v_1| is an agent-subject verb and a no-object verb: “I sleep”, *I sleep the dream* • |run_v_1| is an agent-subject verb and a no-object verb: “I run” , *I run the race* • |run_v_2| is an agent-subject verb and a patient-object verb: *I run* , “I run the race” different senses of same verb subject object S S object subject NP VP NP VP V NP NP passive active be past part
Situation Frames? the eating situation has the volitional thing A as agent and has the concrete thing B as patient role and has the statement that the thing A contains the thing B as result. the hitting situation has the volitional thing A as agent and has the concrete thing B as patient role and has the statement that the thing A contacts the thing B as result.
Types of semantic knowledge • The situation “frame” captures the essence of the meaning. • eating has a volitional actor and a concrete patient • after eating the actor contains the patient • Encyclopedic knowledge may add further domain-specific constraints • eating poisonous food causes harm • not eating causes harm • the food must be small enough to fit in the mouth encyclopedic knowledge is almost unlimited; some could be represented in rules This may be the limiting factor to automation need for human collaboration
Where are my lexemes? • A lexeme can be derived from the basic relations: • all grammatical forms that have the same base form and the same sense (but we do not state WHICH sense) the verb sense |HIT_v_1| the noun sense |HIT_n_1| is a form of is a form of is an inflection of the base verb |hit_vb| is an inflection of the singular noun |hit_nn| the present third singular verb |hit_vbz| the plural noun |hit_ns| the past tense verb |hit_vbd| a paradigm the past participle verb |hit_vbn|
Representing the “expresses” link • For simpler NL processing we should allow the (ambiguous) relation: • the word X expresses the concept Y • However in the new model, this link is not explicit but is justified by a more complex relationship between word, grammatical form, word sense and concept • we can define a rationale for each expresses relation, should it be necessary to be explicit
Rationalising the expresses link • the word |hit| expresses the entity concept hit • because • the singular noun |hit_nn| is written as the word |hit| and • is a form of the word sense |HIT_n_1| and • the word sense |HIT_n_1| expresses the entity concept hit. Note this is still ambiguous The “advanced” lexicon must define these facts
Two complementary approaches to lexicon • direct specification of the expresses link • specification of the full set of relations • as specific rationale, a “because” statement based on the relational path from word to concept • this could be directly stated or inferred by a rule
For reference, here is the LKB “mrscomp” example TFS model- this has not yet been merged with the proposed lexical model
MRSCOMP [! !] [! !] has as ORTH has as GAP [ ] [ ] has as SPR has as ARGS syn-struc string [ ] has as COMPS has as INSTLOC has as SEM has as HEAD handle post pre has as LTOP hook has as HOOK dir semantics has as INDEX has as DIR has as MOD [! !] index modifier pos has as RELS [ ] has as CAT [! !] has as FORM lex-item has as HCONS thing entity ref-ind = ref-ind = noun lexeme word = [] has as ARG0 nominal verb has as KEY has as LBL ?? has as AGR relation pernum non-quant-lexeme has as PRED noun has as ARG1 predsort non-3sing noun-mod noun-lxm noun-form 3sing det arg1-relation = norm form = 3sing sing-noun has as LARG norm qeq fin has as HARG inf pastp presp