550 likes | 565 Views
NL Processing and Fact Extraction 11th May 2013. International Technology Alliance In Network & Information Sciences. David Mott (IBM UK) Stephen Poteet, Ping Xue, Anne Kao (Boeing). Summary.
E N D
NL Processing and Fact Extraction11th May 2013 International Technology Alliance In Network & Information Sciences David Mott (IBM UK)Stephen Poteet, Ping Xue, Anne Kao (Boeing)
Summary • This document summarises the state of the NL processing and fact extraction as at the 11th May 2013. • Key areas addressed: • significant extensions to the work demonstrated at ACITA: • extending the lexicon, based upon WordNet and VerbNet resources, which have been turned into CE • generalising the semantic rules, allowing automatic processing based upon the linguistic structures for verbs defined in VerbNet • extending the analysis of complex verb phrases, including auxillary verbs (have/be etc), providing information about tenses, aspect and active/passive voice • extending CE by use of “linguistic frames”, allowing the structure of CE to be changed by configuration • initial analysis of the LKB structures and how they might be represented in CE, including consideration of “complements” to verb phrases
References • WordNet George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41. Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. • VerbNet Karin Kipper, Anna Korhonen, Neville Ryant, Martha Palmer, A Large-scale Classification of English Verbs, Language Resources and Evaluation Journal, 42(1), pp. 21-40, Springer Netherland, 2008. Karin Kipper Schuler, Anna Korhonen, Susan W. Brown, VerbNet overview, extensions, mappings and apps, Tutorial, NAACL-HLT 2009, Boulder,Colorado.
Progress since ACITA12 • More comprehensive common model of language • Handling compound verb phrases: • “is being loved” • attempting to get tense, voice, aspect, case • constructing temporal relations between situations and utterance • Using Wordnet • defining lexicon of nouns and senses • adding in inflections • adding in links to Domain Specific concepts • Using Verbnet • defining lexicon of verbs with typical grammatical patterns (NP V NP) and senses • adding in inflections • adding in links to Domain Specific concepts • Extensions to CE • based on semantic chart parser and linguistic frames • Exploring the LKB • based on linguistic frames
Architecture Domain Noun Sense – Concept Domain Verb Sense – Concept SYNCOIN Reports Reference (places, orgs) WordNet VerbNet Entity Extractor Situation Extractor Message PreProcessor Stanford Parser CEStore Domain Specific Reasoning CE “Styliser” Analysts “Helper” Missing word-concept links
Purpose • A comprehensive CE model of how language works, consistent with linguistic practice • allows us to build fact extraction applications configured by CE • Allows specification of a rich lexicon (or dictionary) • how words express concepts in the CE domain model • how different forms of words are related (eg singular/plural) • representing semantic information to guide parsing and handle ambiguities • To be consistent in the future with existing lexical resources • Allows us to use Cambridge technology and lexicon (ERG, LKB) • To facilitate use of shallow and deep semantics in a consistent manner • Can build “simple” lexical processing but using the same model as more complex lexical processing
Some Key Concepts in the Lexicon as written on the page word words seen as inflections, and parts of speech the word |hits| is written as grammatical form unique meanings the plural noun |hits_NNS| is a form of is an inflection of word sense concepts the singular noun |hit_NN| the noun sense |HIT_n_1| expresses the verb sense |HIT_v_8| the present third singular verb |hits_VBZ| CE concept is an inflection of the entity concept hit the entity concept ‘hit situation’ the base form verb |hit_VB| conceptualise a ~ hit ~ H that is a force. conceptualise a ~ hit situation ~ S that is an attack.
Building Meaning Grammatical disambiguation (POS tagging) Semantic disambiguation (context, phrase structures, constraints, avoiding inconsistencies) word the word |hits| is written as grammatical form the plural noun |hits_NNS| is a form of is an inflection of word sense the singular noun |hit_NN| the noun sense |HIT_n_1| expresses the verb sense |HIT_v_8| the present third singular verb |hits_VBZ| CE concept is an inflection of the entity concept hit the entity concept ‘hit situation’ the base form verb |hit_VB| conceptualise a ~ hit ~ H that is a force. conceptualise a ~ hit situation ~ S that is an attack.
Other Concepts in the Lexical Model • Parts of Speech • Phrases • head/dependent • Features • tense/aspect/voice • number/person • Roles • agent/patient/theme/location… • Selectional Restrictions (as rules) • Ambiguities • assumption based reasoning • Linguistic Frames
Information Flow wordnet conceptual model “expresses” reference entities parse tree NPs noun dependency fragment verb chain specific entities typed entities VPs PPs situations and generic roles verb dependency fragment specific roles and relations tenses verbnet conceptual model “expresses” temporal relations dialog context containment relations
Analysing Tense, Aspect, Voice, Mood, Case • Tense: the relationship between a situation and the time of utterance • before,after,during • Aspect: an indication of the time “profile”, eg: • an instant in time (punctual) vs a duration • completed or ongoing • Combining Tense and Aspect: • past simple (tense = before, aspect = punctual or habitual) • he drank decaffinated coffee • Voice: active or passive • Mood: should/must/would/can/could • Case: singular/plural male/female 1st/2nd/3rd • Negation? : “not loved”. Complex Area! just starting here
Represent as features the verb phrase #1 has the person category ‘first’ as feature and has the number category ‘singular’ as feature and has the tense category ‘past’ as feature and has the aspect category ‘progressive’ as feature … Actually we will assign features to verb chains not individual VPs, see later
Detecting chains of nested VPs “are conducting” ( there is a verb phrase chain named #27 that has '2' as length and has the verb phrase #101 as first phrase and has the verb phrase #102 as last phrase and has the present tense verb |are_VBP| as first item and has the gerund participle |conducting_VBG| as main verb ). … the verb phrase #101 has as head has as dependent the present tense verb |are_VBP| the verb phrase #102 … has as head the gerund participle |conducting_VBG|
Assigning features to chains if ( the verb phrase chain F has '2' as length and has the verb |are_VBP| as first item and has the gerund participle GP as main verb ) then ( the verb phrase chain F has the tense category 'present' as feature and has the aspect category 'progressive' as feature and has the voice category 'active' as feature ). Note we cannot tell the person or number features, since could be “you-sg/you-pl/they” Should we also assign features to all/some VPs in the chain?
if ( the verb phrase P1 has the verb phrase P2 as dependent and has the verb V1 as head ) and ( the verb phrase P2 has the verb V2 as head ) and ( it is false that there is a verb phrase named P0 that has the verb phrase P1 as dependent ) and ( it is false that the verb phrase P2 has the verb phrase P3 as dependent ) then ( there is a verb phrase chain named F that has '2' as length and has the verb phrase P1 as first phrase and has the verb phrase P2 as last phrase and has the verb V1 as first item and has the verb V2 as main verb ). if ( the verb phrase chain F has '2' as length and has the verb |has_VBZ| as first item and has the past participle PP as main verb ) then ( the verb phrase chain F has the tense category 'past' as feature and has the voice category 'active' as feature ). Sample Rules “has …ed” is active and past find a 2 verb phrase chain
Timing constraints in CE • We already have a conceptualisation of time in CE from the CPM work. • There are a set of temporal constraints that can be applied to temporal entities based on Allen’s interval logic • A situation is a type of temporal entity TIME the situation s1 occurs after the situation s2 s2 s1 unfortuately this reverses the time layout – maybe occurs before is better the situation s1 occurs immediately after the situation s2 s2 s1 s1 the situation s1 occurs within the situation s2 s2 s1 the situation s1 is ended by the situation s2 s2 etc etc.
Assign timing constraints to situations based on the utterance Utterance: they are conducting surveys if ( the verb phrase VP has the tense category 'past' and stands for the situation SIT ) and ( the situation SIT is referenced in the sentence S ) and ( the utterance UT utters the sentence S ) then ( the utterance UT occurs after the situation S ). TIME conduct situation utterance occurs within utters stands for is referenced in verb phrase sentence (is parsed from) “they are conducting surveys” has as feature tense category ‘past’ This is over simplified
More complex case! Utterance:he said the train would come TIME say situation s1 utterance occurs after with utters stands for is referenced in verb phrase v1 sentence (is parsed from) simultaneously “he said the train would come” features tam category ‘past simple’ MAGIC HERE occurs utterance come situation s2 occurs after how do we get from “he said the train would come” to “the train will come”? utters sentence stands for “the train will come” (is parsed from) verb phrase v2 tam category ‘future simple’ features
Timings are Useful • Given the temporal relations we can now: • calculate ordering of events • calculate precise times of events • Useful for: • story telling • “forensic” analysis
Verb/Noun dependency fragments • Packaging up all of the grammatical “positions” from the parse tree into a single structure: • specifier • complement • modifier • For nouns: the noun dependency fragment #1 has the singular noun |lack_NN| as noun and has the prepositional phrase #2 as first complement phrase. • For verbs: the verb dependency fragment #3 has the gerund participle |conducting_VBG| as verb and has the noun phrase #208_NP as subject noun phrase and has the verb phrase #210_VP as verb phrase and has the noun phrase #211_NP as first object noun phrase and has the prepositional phrase #218_PP as first complement prepositional phrase. is it better to just have “first complement phrase” etc?
Using WordNet • Generate list of nouns and their inflections: the plural noun |surveys_NNS| is an inflection of the singular noun |survey_NN|. the singular noun |survey_NN| is an inflection of the singular noun |survey_NN|. • Create links to noun sense: the singular noun |survey_NN| is a form of the noun sense |SURVEY_N_1|. • Link from noun sense to conceptual model must be done by user: the noun sense |SURVEY_N_1| expresses the entity concept ‘survey’. • Analyst Helper can suggest based on NPs with missing semantics but which sense? could use gloss to tell user which is which work done in the past to construct synsets, hyponyms etc to help better suggestions via Analyst Helper
Using VerbNet • Generate list of verbs and their inflections: the past tense verb |conducted_VBD| is an inflection of the base form verb |conduct_VB|. … • Create links to verb sense: the base form verb |conduct_VB| is a form of the verb sense |CONDUCT_V_1|. • Link from verb sense to conceptual model must be done by user: the verb sense |CONDUCT_V_1| expresses the entity concept ‘conduct situation’. • Analyst Helper can suggest based on VPs with missing semantics
VerbNet for computing roles • We can compute situation ROLES from information in VerbNet together with parse structure • Use Verbnet to provide the grammatical patterns and roles for each verb: the verbnet frame 'hit-18.1_1' has 'NP V NP' as grammatical description and has the verbnet noun pattern 'NP' as specifier pattern and has the attribute concept 'agent role' as specifier role and has the verbnet noun pattern 'NP' as first complement pattern and has the attribute concept 'patient role' as first complement role. • For example, the verb “hit” in one sense: the verb sense |HIT_V_1| is constrained by the verbnet frame ‘hit-18.1_1’. • Use this to calculate the roles for things described in parse trees, based upon the verb. for each grammatical “position” define the type and the role
Steps • Grammatical patterns rules to turn specific parse tree patterns into “fragments” eg a NP_V_NP fragment • Finding the Verb Sense for a given verb the base form verb V expresses the verb sense VS • Verbnet Frames the verbnet frame 'hit-18.1_1' has 'NP V NP' as grammatical description and has the verbnet noun pattern 'NP' as specifier pattern and has the attribute concept 'agent role' as specifier role and has the verbnet noun pattern 'NP' as first complement pattern and has the attribute concept 'patient role' as first complement role. • Linking Verb Sense to Verbnet Frame the verb sense |HIT_V_1| is constrained by the verbnet frame ‘hit-18.1_1’. • Mapping from Verb Sense to situation type the verb sense VS expresses the entity concept ‘XXX’ • Mapping to situation attributes and relations thehit situation S agentifies the entity concepthitter and patientifies the entity concept ‘thing hit’ and is viewed relationally as the relation concepthits. Just the template, to be overridden by domain user RESULT the thing T1 hits the thing T2. the hit situation S1 has the thing T1 as hitter and has the thing T2 as thing hit. “ifications” open to discussion!
In more detail SP NP VP V NP the verb dependency fragment f1 has the past tense verb |hit_VBD| as verb and has the noun phrase NP1 as subject noun phrase and has the verb phrase VP as verb phrase and has the noun phrase NP2 as first object noun phrase. fragment patterns in rules hit the past tense verb |hit_VBD| is an inflection of the verb |hit_VB|. the verb |hit_VB| is a form of the verb sense |HIT_V_1|. the noun phrase NP1 stands for the thing T1. the noun phrase NP2 stands for the thing T2. the verb phrase VP stands for the situation S1. the verb sense |HIT_V_1| is constrained by the verbnet frame ‘hit-18.1_1’. LEXICAL CURVE matching the grammatical “positions” against the verbnet frame patterns • the verbnet frame hit-18.1_1’ has • the verbnet pattern ‘NP’ as specifier pattern and • has ‘agent role‘ as specifier role and • has the verbnet noun pattern 'NP' as first complement • pattern and • has ‘patient role’ as first complement role. the situation S1 has the thing T1 as agent role and has the thing T2 as patient role. the noun sense |HIT_N_1| nominalises the verb sense |HIT_V_1|. specialising the roles into domain specific roles and relations DOMAIN CURVE the noun sense |HIT_N_1| expresses the entity concept ‘hit situation’. the hit situation S agentifies the entity concept ‘hitter’ and patientifies the entity concept ‘thing hit’ and is viewed relationally as the relation concept ‘hits’. the hit situation S1 has the thing T1 as hitter and has the thing T2 as thinghit. the thing T1 hits the thing T2.
if ( the sentence phrase SP has the noun phrase NP as dependent and has the verb phrase VP as head ) and ( the verb phrase chain VBC has the verb phrase VP as first phrase and has the verb phrase VP1 as last phrase and has the verb V as main verb and has the voice category 'active' as feature ) and then ( there is a verb dependency fragment named F that has the verb V as verb and has the noun phrase NP as subject noun phrase and has the verb phrase VP1 as verb phrase and has the voice category 'active' as feature ) . if ( the verb dependency fragment F has the noun phrase NP as subject noun phrase and has the verb phrase VP as verb phrase and has the voice category 'active' as feature) and ( the noun phrase NP stands for the thing T ) and ( the verb phrase VP is associated with the verb sense VS and stands for the situation S ) and ( the verb sense VS is constrained by the verbnet frame VNF ) and ( the verbnet frame VNF has the verbnet pattern 'NP' as specifier pattern and has the value ROLE as specifier role ) then ( the situation S has the thing T as #ROLE ) . Sample Rules – Lexical Curve for a given active verb fragment, get a role of a situation from the verb sense associated with the subject verb phrase find a fragment with a subject from active voice verb chain in context of a sentence phrase
Sample Rule – Domain curve apply the domain specific role name to the agent of a situation if ( the situation S is an #EC and has the thing A as agent role ) and ( the situation concept EC agentifies the attribute concept AC ) then ( the situation S has the thing A as #AC ) ).
Issue • Some verbs (eg “hit” ) have different patterns in Verbnet: • some are grammatical alternatives: • John hit the dog • John hit the dog with the stick • some are under different senses: • John hit the dog (filed under “hit”) • John hit the hill (filed under “reach”) • Do we: • just encode the basic ones for now • handle this using assumptions? • add selectional restrictions to remove/reduce alternatives? • eg agent of the sense “|HIT_V_1|” must be volitional • Need to involve both the lexicon and the domain in applying selectional restrictions • We can extract more from Verbnet, eg selectional restrictions, event semantics This is the current approach, although other work has been reported to suggest how assumptions could be used
Sample Sentence HTT are conducting surveys in Adhamiya to judge the level of support for Bath’est return. the organisation #64 known as |Htt| conducts the survey #59. from wordnet + domain from verbnet NP V NP, roles + domain from reference entity
Basic Semantics - Prepositions too simplistic!
Not yet handled • mutual exclusivity of alternatives • full selectional restrictions • full set of VerbNet: • grammatical patterns • selectional restrictions • event semantics • more types of prepositional phrases • adjectives • relational clauses, subclauses
Domain Reasoning • Using the facts extracted to perform reasoning tasks • This section is provided to show the complete picture of the current work, but was taken from the ACITA12 paper
Domain Semantics 02/24/10 - Cell call is monitored between unknown caller (7678112233) in Amin to Amir Mahallati (7115452376) in Bayaa. The unidentified caller stated: "The team is a failure! The carpet doesn't match! The carpet maker needs to be replaced." The recipient said: "The measurements were perfect, the installers must have failed.” • “Communications” • SYNCOIN reports speak about monitoring communications between people together with the things that were said conceptualise a ~ communication ~ C that has the agent A as ~ caller ~ and has the agent B as ~ recipient ~ and has the value D as ~ date ~ and has the value T as ~ time ~ and has the value V1 as ~ caller utterance ~ and has the value V2 as ~ recipient utterance ~. conceptualise the communication C ~ is from ~ the place FROM and ~ is to ~ the place TO.
if ( the situation S is a communications monitoring and has the thing T as patient role ) then ( the communications monitoring S monitors the communication T ). “the communication comes from the place where the caller is” if ( the communication C has the agent A as caller ) and ( the agent A is located in the place P ) then ( the communication C is from the place P ). Each of these are simple steps, but require information derived from other steps: How do we know the caller agent in a communication? How do we know where the agent is located? Specific communications examples “the thing being “done to” in a communications monitoring is actually a communication” No mention of syntax or phrase structure, you don’t need to be a linguist!
Example communication there is a call named '#3' that has the agent #5 as caller and has the agent #15 known as |Amir Mahallati| as recipient and has '02/24/10' as date and has 'The team is a failure! The carpet doesn\'t match! The carpet maker needs to be replaced@' as caller utterance and has 'The measurements were perfect, the installers must have failed@' as recipient utterance and is from the place #13 known as |Amin| and is to the place #21 known as |Bayaa|. CE Store can display communications on maps, from CE extracted facts
Analyst creates a new concept conceptualise a ~ replay conversation ~ RC that is a thing and has the agent A as ~ first caller ~ and has the agent B as ~ middle man ~ and has the agent C as ~ second recipient ~ and has the communication COM1 as ~ first communication ~ and has the communication COM2 as ~ second communication ~ and has the sequence ( the value ~ first communication ~ and the value ~ second communication ~ ) as identifier. Analyst creates a new rule to define them if ( there is a communication COM1 that has the agent CALLER1 as caller and has the agent RECIPIENT1 as recipient ) and ( there is a communication COM2 that has the agent CALLER2 as caller and has the agent RECIPIENT2 as recipient ) and ( the agent RECIPIENT1 is the same as the agent CALLER2 ) and then ( there is a replay conversation named RC that has the communication COM1 as first communication and has the communication COM2 as second communication and has the agent CALLER1 as first caller and has the agent RECIPIENT1 as middle man and has the agent RECIPIENT2 as second recipient ). The analyst has a new idea! Why don’t we look for “back to back communications with same middle man”? Are they passing on the same information to each other? Perhaps we can track linkages between groups? second recipient first caller middleman First communication Second communication
Exploring the analyst’s concepts • “add the utterances of the caller and the recipient to see what they are saying” • Now we can create the concept of a code with code words, and rules to infer information: • Communications contain code words • More possibilities: • Showing the spread of codewords on a map • Use of codes to suggest organisational structure Looks like some form of code: carpet = device?
Holistic View of rules Multiple skills of language, semantics, and domain specific reasoning involved
Extensions to CE • Need to extend the CE language for greater expressibility: • adjectives • prepositions • more readable names in context • … We can use “linguistic frames” to define CE extensions
Investigating “Linguistic Frames” • A linguistic frame is a CE-based structure that defines a step in the NL processing by specifying: • the syntax pattern • the preconditions/constraints • the resulting semantics • By applying all of the steps to a sentence we can generate a parse tree and construct the semantics of the sentence: • semantics of one part is composed from the semantics of its subcomponents • using a chart parser • This can be applied to parsing both CE and NL • can define CE extensions language complexity Extended CE Basic CE Natural Language
Structure of Linguistic Frame the frame finds an instance of this component in the sentence WHEN there is a linguistic frame named F that defines the <phrase type> S and has the sequence SEQ as syntax and has the statement that CE_STATEMENT as preconditions and has the statement that CE_STATEMENT as semantics and has the thing T as portal variable. this sequence of words and phrases is present in the sentence AND these statements are true IN WHICH CASE these statements define the semantics of the component and the result is passed back up the tree though this variable
Defining Basic CE in linguistic frames! there is a linguistic frame named np1 that defines the noun phrase NP and has the sequence ( the determiner '|the_DT|' , the singular noun COMMON , and the proper noun PN ) as syntax and has the statement that ( the proper noun PN has the value NAME as written form ) and ( the singular noun COMMON is a form of the noun sense NS ) and ( the noun sense NS expresses the entity concept EC ) as preconditions and has the statement that ( there is a thing named NAME that is an #EC ) as semantics and has the thing NAME as portal variable. syntax semantics
Extensions to CE can be defined… Intuition: we can represent adjectives as “XXX thing”, eg convert “John is red” to “John is a red thing” there is a linguistic frame named adjective_predicate that defines the verb phrase VP and has the sequence ( the present third singular verb '|is_VBZ|' , and the adjective ADJ ) as syntax and has the statement that ( the adjective ADJ has the value WF as written form ) and ( the value ECTERM = the value WF <> the constant ‘ thing’) and ( the entity concept EC has the value ECTERM as concept name ) as preconditions and has the statement that ( the thing X is a #EC ) as semantics and has the thing X as portal variable. clunky way to say that the new concept is called, for example ‘red thing’