290 likes | 389 Views
BURC: B ootstrapping U sing R esearch C yc. By Kino Coursey. Introduction to the Problem. Goal: To extend Cyc’s knowledge base using “relationships implied to be possible, normal or commonplace in the world” Prior work with Cyc knowledge entry has been manually oriented
E N D
BURC: Bootstrapping Using ResearchCyc By Kino Coursey
Introduction to the Problem • Goal: To extend Cyc’s knowledge base using “relationships implied to be possible, normal or commonplace in the world” • Prior work with Cyc knowledge entry has been manually oriented • How will we collect commonsense without a body and manual labor…? • Read, Parse, Mine! • Proposal: Read text, Parse into a database, Extract relations between words, Propose hypothetical relations between concepts
Basic Analogy • The Shotgun approach to the Human Genome • Extract millions of fragments then knit them back together by finding commonalities • Will it work for the Human Menome?
What is Cyc? • “the world's largest and most complete general knowledge base and commonsense reasoning engine” • Started in mid 1980’s (“should take only 10 years….”) • Logic Based • LISP oriented • For WordNet users, each Concept ≈ Synset • Available from http://www.opencyc.org http://researchcyc.cyc.com • Big (ResearchCyc v0.8) • Constants 89,379 • Assertions 968,985 • Deduction 361,185 • Sample Collection Extents • EnglishWord 18,007 • Event 6,050 • PartiallyTangible 24,387 • Microtheory 1,688
Collection : Finger GAF Arg : 1 Mt : UniversalVocabularyMtisa : AnimalBodyPartType quotedIsa : DensoOntologyConstant genls : Digit-AnatomicalPart comment : "The collection of all digits of all Hands (q.v.). Fingers are (typically) flexibly jointed and are necessary to enabling the hand (and its owner) to perform grasping and manipulation actions." Mt : BaseKBdefiningMt : AnimalPhysiologyVocabularyMt Mt : AnimalPhysiologyMtproperPhysicalPartTypes : Fingernail Mt : WordNetMappingMt(synonymousExternalConceptFingerWordNet-Version2_0 "N05247839")(synonymousExternalConceptFingerWordNet-1997Version "N04312497") GAF Arg : 2 Mt : UniversalVocabularyMt(genlsLittleFingerFinger)(genlsIndexFingerFinger)(genlsThumbFinger)(genlsRingFingerFinger)(genlsMiddleFingerFinger) Mt : HumanActivitiesMt(bodyPartsUsed-TypeTypeTypingFinger) Mt : HumanSocialLifeMt(bodyPartsUsed-TypeTypePointingAFingerFinger) Example of what Cyc currently knows about fingers
Mt : AnimalPhysiologyMt -(conceptuallyRelatedFingernailFinger)(properPhysicalPartTypesHandFinger)(relationAllInstanceageFinger(YearsDuration 0 200))(relationAllInstancewidthOfObjectFinger(Meter 0.001 0.2))(relationAllInstanceheightOfObjectFinger(Meter 0.001 0.2))(relationAllInstancelengthOfObjectFinger(Meter 0.01 0.5))(relationAllInstancemassOfObjectFinger(Kilogram 0.001 1)) GAF Arg : 3 Mt : HumanPhysiologyMt(relationAllExistsanatomicalPartsHomoSapiensFinger) Mt : VertebratePhysiologyMt(relationAllExistsCountphysicalPartsHandFinger 5) Mt : UniversalVocabularyMt(relationAllOnlywornOnRing-JewelryFinger) Mt : AnimalPhysiologyMt(relationExistsAllphysicalPartsHandFinger) GAF Arg : 4 Mt : GeneralEnglishMt(denotationFinger-TheWordCountNoun 0 Finger) Example of what Cyc currently knows about fingers - 2
Bootstrapping with ResearchCyc • Cyc has vocabulary about objects in the world and relationships • Cyc could still use more common relationships • BURC uses what Cyc already has + lots of parsed text to create new Cyc entries for common relationships found in the text • Lenat’s Bootstrap Hypothesis: once Cyc reaches a certain level/scale it can help in its own development and start using NLP to augment its knowledge base • BURC should help test this hypothesis
The BURC Process From seeds…Hypothe-seed’s • Use the link grammar parser for bulk parsing of text, primarily narratives based in ‘worlds like ours’. Other text styles could be included. • Operates in two directions: • Forward from text to CycL • Backwards from existing CycL to the text to find new forward patterns
BURC Process - 2 • Load the link fragments into a database (1 and 2 link fragments), and compute frequency of fragment occurrences. The database will be in a SQL format so multiple queries can be formed dynamically. • Using Cyc knowledge as a starting point (the seeds), extract knowledge for use in Cyc: • Given a set of seed facts in Cyc, identify how those facts are represented as link fragments in the database • Generate conjectures as to new knowledge AND new knowledge extraction patterns using the fragment patterns.
BURC Process - 3 • Use Cyc knowledge directly to conjecture new statements: • Cyc has lexical knowledge, which can be used as templates against the DB to form new statements • For example, common adjectives applied to noun classes • Cyc knows “WhiteColor” and “Blouse” but does not know that white is a common blouse color, although it becomes apparent after reading some text • Optionally, gather supporting background statistics for hypothesis verification using other sources: • Perhaps Google desktop with a larger than fully parsed corpus • Perhaps check against answer extraction engines
KNEXT (KNowledge EXtraction from Text) • Deriving general world knowledge from texts and taxonomies: • http://www.cs.rochester.edu/~schubert/projects/world-knowledge-mining.html • Lenhart K. Schubert and Matthew Tong, "Extracting and evaluating general world knowledge from the Brown Corpus", Proc. of the HLT-NAACL Workshop on Text Meaning, May 31, 2003, Edmonton, Alberta, pp. 7-13. • System extracts commonsense relationships from text • Limited to the pre-parsed Penn Treebank • Generated 117,326 propositions (about 2 per sentence) • About 60% judged reasonable by any given judge
KNEXT (Example) (BLANCHE KNEW 0 SOMETHING MUST BE CAUSING STANLEY 'S NEW, STRANGE BEHAVIOR BUT SHE NEVER ONCE CONNECTED IT WITH KITTI WALKER.) A FEMALE-INDIVIDUAL MAY KNOW A PROPOSITION. SOMETHING MAY CAUSE A BEHAVIOR. A MALE-INDIVIDUAL MAY HAVE A BEHAVIOR. A BEHAVIOR CAN BE NEW. A BEHAVIOR CAN BE STRANGE. A FEMALE-INDIVIDUAL MAY CONNECT A THING-REFERRED-TO WITH A FEMALE-INDIVIDUAL. ((:I (:Q DET FEMALE-INDIVIDUAL) KNOW[V] (:Q DET PROPOS)) (:I (:F K SOMETHING[N]) CAUSE[V] (:Q THE BEHAVIOR[N])) (:I (:Q DET MALE-INDIVIDUAL) HAVE[V] (:Q DET BEHAVIOR[N])) (:I (:Q DET BEHAVIOR[N]) NEW[A]) (:I (:Q DET BEHAVIOR[N]) STRANGE[A]) (:I (:Q DET FEMALE-INDIVIDUAL) CONNECT[V] (:Q DET THING-REFERRED-TO) (:P WITH[P] (:Q DET FEMALE-INDIVIDUAL))))
Other Extraction Pattern Research • Towards Terascale Knowledge Acquisition (Pantel, Ravichandran and Hovy, 2004) • Learning Surface Text Patterns for a Question Answering System (Ravichandran & Hovy, 2002) • Defined Pattern Precision P = Ca/Co Ca = total number of patterns with answer term present Co = Total number of patterns with any term present • DIRT – Discovery of Inference Rules from Text (Lin & Pantel, 2001)
Other Lexical Knowledge Research • VerbOcean (Chklovski & Pantel): Collecting pairs and searching to verify relationships • Lexical Acquisition via Constraint Solving (Pedersen & Chen): Acquiring syntactic and semantic classification rules of unknown words for LGP • Information Extraction Using Link Grammar papers • Automatic Meaning Discovery Using Google
The General Backwards Model • Given some Cyc relation Pred(?X,?Y) • Create SQL search query • Lookup in Cyc lexical entries for X & Y LX, LY • Select * from LGPTable where Term1="<LX>" and Term3="<LY>“ • System returns records [LX | Link1 | Term2 | Link2 | LY] (Freq) • Generate new hypothetical extraction patterns • Select * from LGPTable where Link1="<L1>" and Link2="<L2>" and Term2="<T2>“ • [* L1 T2 L2 *] generate hypothetical record ( Pred |?S1|?S3 ) • Frequency information is propagated forward
The General Backwards Model - 2 • Optional: Search Cyc for ?PRED (X,Y) and use the set to form a local ambiguity class to reduce search labor and identify ambiguity. One rule multiple relations. • Stored as “SQLTemplate \ Pattern \ Pred1/Pred2/…/PRedN” • Need to explore (canidateBinaryPred ARG1 ARG2 RELN) • Optional: Form more specific patterns for Pred(X,_) and Pred(_,Y)
Update the LGParser’s CycL Rules • There are rules for translation of LGP output into CycL • If the frequency information warrants it then we can generate new LGP rules • Results in expanded parser precision <rule> <pattern>* {Link1} {Term2} {Link2}*</pattern> <define>?ITEM%r </define> <body>(#$is-node ?ITEM%r "%R")</body> <define>?ITEM%l </define> <body>(#$is-node ?ITEM%l "%R")</body> <body>({?PRED1} ?ITEM%l ?TERM%r)</body> ... <body>({?PREDN} ?ITEM%l ?TERM%r)</body> </rule>
Forward Mining Adjective Relations • There are 1941 GAF’s on adjSemTrans, the primary lexical adjective predicate • Find applicable fragments and use definitions: • “Select * from LGPTable Where NumLinks=1 and Link1='a' and Term1 like '%.a' and Term2 like '%.n‘ ” • Returns records [Term1.a | a | Term2.n] • Potentially test using either an internal or search engine based relevancy metric • Query Cyc for “(adjSemTrans <term1>-TheWord ?N RegularAdjFrame (?Pred :NOUN ?Val))” • Generate (plausiblePredValOFType <term2> <?Pred> <?Val>) • Possibly generate parsing rule
Mining Adjective Knowledge Example • “white blouse” as factoid • [white.a | a | blouse.n] • Potentially test using an internal or search engine relevancy metric [GC=70400] • (adjSemTrans White-TheWord 11 RegularAdjFrame (mainColorOfObject :NOUN WhiteColor)) • Hypothesis: (plausiblePredValueOfType Blouse mainColorOfObject WhiteColor)
Update the LGParser’s CycL Rules - 2 • There are rules for translation of LGP output into CycL • We can use the adjSemTrans data to generate new translation rules • Results in expanded parser precision <rule> <pattern> {Term1.a} a *</pattern> <define>?ITEM%r </define> <body>(#$is-node ?ITEM%r "%R")</body> <define>?ITEM%l </define> <body>({?PRED} ?ITEM%r {?VAL})</body> </rule> <rule> <pattern> white.a a *</pattern> <define>?ITEM%r </define> <body>(#$is-node ?ITEM%r "%R")</body> <define>?ITEM%l </define> <body>(mainColorOfObject ?ITEM%r WhiteColor)</body> </rule>
000010:(#$plausiblePredValueOfType #$Finger #$feelsSensation (#$PositiveAmountFn #$LevelOfSoreness)) 000037:(#$plausiblePredValueOfType #$Finger #$forceCapacity #$Strong) 000025:(#$plausiblePredValueOfType #$Finger #$forceCapacity #$Strong) 000025:(#$plausiblePredValueOfType #$Finger #$hardnessOfObject #$Hard) 000037:(#$plausiblePredValueOfType #$Finger #$hardnessOfObject (#$MediumToVeryHighAmountFn #$Hardness)) 000037:(#$plausiblePredValueOfType #$Finger #$hardnessOfObject (#$MediumToVeryHighAmountFn #$Hardness)) 000002:(#$plausiblePredValueOfType #$Finger #$hasEvaluativeQuantity (#$MediumToVeryHighAmountFn #$Goodness-Generic)) 000002:(#$plausiblePredValueOfType #$Finger #$hasPhysicalAttractiveness #$GoodLooking) 000047:(#$plausiblePredValueOfType #$Finger #$isa (#$LeftObjectOfPairFn :REPLACE)) 000015:(#$plausiblePredValueOfType #$Finger #$isa (#$RightObjectOfPairFn :REPLACE)) 000155:(#$plausiblePredValueOfType #$Finger #$lengthOfObject (#$RelativeGenericValueFn #$lengthOfObject :REPLACE #$highAmountOf)) 000155:(#$plausiblePredValueOfType #$Finger #$lengthOfObject (#$RelativeGenericValueFn #$lengthOfObject :REPLACE #$highToVeryHighAmountOf)) 000003:(#$plausiblePredValueOfType #$Finger #$mainColorOfObject #$BlackColor) 000010:(#$plausiblePredValueOfType #$Finger #$mainColorOfObject #$LightYellowishBrown-Color) 000010:(#$plausiblePredValueOfType #$Finger #$mainColorOfObject #$ModerateYellowishBrown-Color) 000010:(#$plausiblePredValueOfType #$Finger #$mainColorOfObject #$SunTan-FleshColor) 000002:(#$plausiblePredValueOfType #$Finger #$possessiveRelation #$SuddenChange) Mined Finger Descriptions
Mined Finger Descriptions 000006:(#$plausiblePredValueOfType #$Finger #$possessiveRelation (#$HighAmountFn #$Speed)) 000094:(#$plausiblePredValueOfType #$Finger #$rigidityOfObject (#$HighAmountFn #$Rigidity)) 000060:(#$plausiblePredValueOfType #$Finger #$sizeParameterOfObject (#$RelativeGenericValueFn #$sizeParameterOfObject :REPLACE #$highAmountOf)) 000052:(#$plausiblePredValueOfType #$Finger #$sizeParameterOfObject (#$RelativeGenericValueFn #$sizeParameterOfObject :REPLACE #$highToVeryHighAmountOf)) 000060:(#$plausiblePredValueOfType #$Finger #$sizeParameterOfObject (#$RelativeGenericValueFn #$sizeParameterOfObject :REPLACE #$highToVeryHighAmountOf)) 000285:(#$plausiblePredValueOfType #$Finger #$sizeParameterOfObject (#$RelativeGenericValueFn #$sizeParameterOfObject :REPLACE #$veryLowToLowAmountOf)) 000074:(#$plausiblePredValueOfType #$Finger #$sizeParameterOfObject (#$RelativeGenericValueFn #$sizeParameterOfObject :REPLACE #$veryLowToLowAmountOf)) 000029:(#$plausiblePredValueOfType #$Finger #$speedOfObject-Underspecified (#$LowAmountFn #$Speed)) 000138:(#$plausiblePredValueOfType #$Finger #$surfaceFeatureOfObj #$Slippery) 000074:(#$plausiblePredValueOfType #$Finger #$temperatureOfObject #$Warm) 000004:(#$plausiblePredValueOfType #$Finger #$textureOfObject #$Rough) 000168:(#$plausiblePredValueOfType #$Finger #$thicknessOfObject (#$RelativeGenericValueFn #$thicknessOfObject :REPLACE #$highAmountOf)) 000168:(#$plausiblePredValueOfType #$Finger #$thicknessOfObject (#$RelativeGenericValueFn #$thicknessOfObject :REPLACE #$highToVeryHighAmountOf)) 000182:(#$plausiblePredValueOfType #$Finger #$wetnessOfObject #$Wet)
Verb Semantic Filtering -1Discovering what a finger can do… • A similar process can be used finding information based on verb semantic parsing frames • For each potential <NOUNWORD>-<VERB> pair query Cyc to find basic relationships using the verb semantic templates (#$and (#$denotation <NOUNWORD> ?NOUNTYPE ?N ?CYCTERM) (#$wordForms ?WORD ?PRED ""<VERB>"") (#$speechPartPreds ?POS ?PRED) (#$semTransPredForPOS ?POS ?SEMTRANSPRED) (?SEMTRANSPRED ?WORD ?NUM ?FRAME ?TEMPLATE)) • Verify for each potential relationship (<SPRED> <VERTERM> <CYCTERM>) derivable from ?TEMPLATE that it makes sense in the ontology (#$and (#$arg1Isa <SPRED> ?VTYP) (#$arg2Isa <SPRED> ?CTYP) (#$genls <CYCTERM> ?CTYP) (#$genls <VERBTERM> ?VTYP) )
Verb Semantic Filtering -2Templates of Movement… (verbSemTransMove-TheWord 0 IntransitiveVerbFrame (and (isa :ACTION MovementEvent) (primaryObjectMoving :ACTION :SUBJECT))) (verbSemTransMove-TheWord 1 IntransitiveVerbFrame (and (isa :ACTION ChangeOfResidence) (performedBy :ACTION :SUBJECT))) (verbSemTransMove-TheWord 2 TransitiveNPFrame (and (isa :ACTION CausingAnotherObjectsTranslationalMotion) (objectActedOn :ACTION :OBJECT) (doneBy :ACTION :SUBJECT))) (arg1IsaperformedByAction) (arg2IsaperformedByAgent-Generic)
Verb Semantic Filtering - 3 • BURC can use Cyc’s knowledge of what things can perform what actions or have what attributes to filter out implausible relationships. (#$behaviorCapableOf #$Finger #$CausingAnotherObjectsTranslationalMotion #$doneBy) (#$behaviorCapableOf #$Finger #$ChangeOfResidence #$performedBy) (#$behaviorCapableOf #$Finger #$Inspecting #$performedBy) (#$behaviorCapableOf #$Finger #$Movement-TranslationEvent #$primaryObjectMoving) (#$behaviorCapableOf #$Finger #$MovementEvent #$primaryObjectMoving) (#$behaviorCapableOf #$Finger #$PushingAnObject #$providerOfMotiveForce) (#$behaviorCapableOf #$Finger #$Sliding-Generic #$objectMoving) (#$behaviorCapableOf #$Finger #$Sliding-Generic #$primaryObjectMoving) (#$behaviorCapableOf #$Finger #$Slipping #$objectMoving) (#$behaviorCapableOf #$Finger #$Slipping #$primaryObjectMoving) • Cyc can help in its own knowledge entry process. 62% of generated hypothesis were filtered out using semantic role filtering.
Other Direct Extraction Rules • Some “underspecified” patterns exist just based on the links • This could be used to extract ConceptNet like output directly from link records • Examples: • [<obj1>|ss|<act>.v|os|<obj2>] capableOf(<obj1>, “<act> <obj2>”) • [<act>.v |os|<obj>] CapableOfReveivingAction(<obj>,<act>) • [<obj>|s*|<act>.v] capableOf(<obj>,<act>)
Quest for Metrics • Percentage of hypothesis that make sense to a panel of judges • Percentages of hypothesis that are already known to Cyc • Percentage of hypothesis that are known in other knowledge sources (WordNet, Sumo/Milo, VerbOcean, MIT OpenMind…) • Number of hypothesis generated vs. number of records • What percentage of relations in Cyc can be found in the fragment pool • The Pattern Precision measure • Maybe compare against KNEXT but need to see if they return real numbers • Unfortunately we don’t know all possible knowledge (otherwise we wouldn’t be doing this), because if we did we could measure recall and precision. • Simple space estimate (2.3K binary predicates * 85K constants * 85K constants = 16.617500 T simple possibilities)
Desired Outputs • Version of link grammar for bulk reading and generating fragments • Database control program to queue texts, monitor their processing, and merge the fragment results • The database of fragments with fragment counts for some corpus • The hypothesis set generated by the system • Optionally an OpenMind / ConceptNet like set of commonsense factoids • Open enough that others could duplicate
Did any of that make sense? • Comments? • Questions? • Suggestions?