1.51k likes | 1.52k Views
Discover the journey and insights behind building Cyc, a system that revolutionized natural language understanding, robotics, learning, and expert systems. Dr. Douglas B. Lenat shares valuable lessons and approaches to advancing knowledge engineering in this informative text. Learn about the evolution of Cyc from the roots of ELIZA to modern systems like ALICE, and delve into the complexities of natural language processing, logic, and data integration. Explore how ontology helps in combining diverse data sources efficiently and uncover the importance of commonsense knowledge in AI development.
E N D
CYC: Lessons Learned in Large-Scale Ontological Engineering Dr. Douglas B. Lenat , 3721 Executive Center Drive, Suite 100, Austin, TX 78731 Email: Lenat@cyc.com Phone: (512) 342-4001 Fax: (512) 342-4040 2 July 2005
What Led to Cyc? The Need • Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984). The Opportunity
ELIZA (DOCTOR) 1965 Joe Weizenbaum, MIT • Carl Rogers-like reflection Patient: “I swear a lot.” “How do you feel about the fact that you swear a lot?” • Count on patient-to-doctor respect/awe/… Patient: “My dog’s mother died recently.” “Tell me more about your mother.” • “blah blah blah mother blah blah blah”
40 years after ELIZA: ALICE • JUDGE11[9:26:0]I WILL ASK YOU SOME SIMPLE QUESTIONS AND GIVE SIMPLE COMMANDS • PROGRAM[9:26:01]Do you think your plan will succeed? • JUDGE11[9:26:16]WHAT IS THE COLOR OF A BLUE CAR • PROGRAM[9:26:16]WHAT COLOR IS a blue car • JUDGE11[9:26:35]WHERE IS SUE’S NOSE WHEN SUE IS IN HER HOUSE • PROGRAM[9:26:35]Where it belongs. Try searching the World Wide Web.
Is Natural Language a Good Repr.? • + there’s a lot of it Everyone understands it • + index and search it, rapidly, using keywords • Boolean combinations of keywords • Synonyms, hyponyms, hypernyms,… of keywords • - there are a lot of different languages • - meanings vary (era, place, age group…) • - often the analyst’s query requires finding and combining n pieces of data • - can be inefficient Logic Arithmetic
Ann -- Andy Carol -- Sam Sue Bob -- Sara -- John Fred --Jane Ethan Elaine Edward Joan Steven Mary Seth Carol and Sam begat Sara and Fred. Fred and Jane begat Ethan, Elaine, and Edward. John and Sara begat Steven, Mary, and Seth. Ann and Andy begat Sue and Bob. But then Sara cleaved not to John and with Bob begat Joan. Is Edward an ancestor or descendant of Sue?
Five friends get together to play 5 doubles matches, with a different group of 4 players each time. The sums of the ages of the players for the different matches are 124, 128, 130, 136 and 142 years. What is the age of the youngest player ? v+w+x+y = 124 v+w+x+z = 128 v+w+y+z = 130 v+x+y+z = 136 w+x+y+z = 142
Natural Language Understanding requires having lots of knowledge 1. The pen is in the box. The box is in the pen. 2. The police watched the demonstrators… …because they feared violence. …because they advocated violence. 3. Every American has a mother. Every American has a president.
Natural Language Understanding requires having lots of knowledge 4. Mary and Sue are sisters. Mary and Sue are mothers. 5. The White House announced today that... 6. John saw his brother skiing on TV. The fool… ...didn’t have a coat on! …didn’t recognize him!
( ) Logically and Arithmetically Combining n Pieces of Info. An example: an analyst’s query posed as part of HPKB (1996) that Cyc answered. Information from multiple sources Knowledge about the domain in general Commonsense knowledge about the real world
( ) Logically and Arithmetically Combining n Pieces of Info. Ontology holds the key to doing this! BUT there are so many ways to “cut corners” and unwittingly fool oneself! Information from multiple sources Knowledge about the domain in general Commonsense knowledge about the real world The original dream of Arpanet, EDI, EDR, the Semantic Web,…
DB4 SuspN Qusay Hussein Uday Hussein FBI Most Wanted DB4 CATS NARCL DB8 CDE DB8 USGS OFAC Prenom Surnom ann Qusai Hussein 30 Odai Hussein Query: “How different in age were Uday and Qusay Hussein?” Sept. 9, 2003 YOB 1964 Non-ontology-based methods for DB inte-gration are quadratic Dec. 31, 1996
FBI Most Wanted DB4 CONCEPTS #$QusayHusseinAl-Takriti #$UdaiHusseinAl-Takriti CATS RULES (age ?PERSON (YearsDuration ?AGE)) NARCL DB8 (birthDate ?PERSON ?BIRTH-DATE) you! CYC HAL CDE USGS OFAC Ontology-Based Methods of DB Integration Can Scale Linearly DB4 Sept. 9, 2003 SuspN YOB Qusay Hussein 1966 Uday Hussein 1964 DB8 Dec. 31, 1996 Prenom Surnom ann (…and, by the way, enables DB population/enrichment) Qusai Hussein 30 Odai Hussein 32
FBI Most Wanted DB4 CATS NARCL DB8 CDE USGS OFAC A Solution that Scales Linearly DB4 Sept. 9, 2003 SuspN YOB Qusay Hussein 1966 Uday Hussein 1964 DB8 Dec. 31, 1996 Prenom Surnom ann (…and, by the way, enables DB population/enrichment) Qusai Hussein 30 Odai Hussein 32
USGS GNIS DB AMVA KB UN FAO DB DTRA CATS DB RAND R “What major US cities are particularly vulnerable to an anthrax attack?” The answer is logically implied by data dispersed through several sources:
“What major US cities are particularly vulnerable to an anthrax attack?” “major US city” ?C is aU.S. City with >1M population “particularly vulnerable to an anthrax attack” • the current ambient temperature at ?C is above freezing, and • ?C has more than 100 people for each hospital bed, and • the number of anthrax host animals near ?C exceeds 100k (> (NumberOfInhabitantsFn ?C) 106) Don’t add #pullets and #chickens
USGS GNIS DB state | name | type | county | state_fips | -------+-----------------------+-------+----------------+------------+ TX | Dallas | ppl | Dallas | 48 | MN | Hennepin County | civil | Hennepin | 27 | CA | Sacramento County | civil | Sacramento | 6 | AZ | Phoenix | ppl | Maricopa | 4 | primary_lat | primary_long| elevation | population | status | ------------+-------------+-----------+------------+------------------+ 32.78333 | -96.8 | 463 | 1022830 | BGN 1978 1959 45.01667 | -93.45 | 0 | 1032431 | 38.46667 | -121.31667 | 0 | 1041219 | 33.44833 | -112.07333 | 1072 | 1048949 | BGN 1931 1900 1897 The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB • So how do we explain to our system that: • row 1 of that table is “about” the city of Dallas, TX • the population field of that table contains the number of inhabitants of the city that that row is “about” • here is exactly how to access tuples of that database • that access will be fast, accurate, recent, complete The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB • the population field of that table contains the number of inhabitants of the city that that row is “about” • We provide the field encodings and decodings, some of which correspond to explicit fields like population, two-letter state codes, etc: (fieldDecoding Usgs-Gnis-LS ?x (TheFieldCalled “population”) (numberOfInhabitants (TheReferentOfTheRow Usgs-Gnis) ?x)) The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB • row 1 of that table is “about” the city of Dallas, TX • We provide the field encodings and decodings, some of which correspond to explicit fields like population, and some correspond to entities whose existence is merely implied by the existence of that row in that table (in this case, the first row implies the existence of -- and describes some specifics of -- the geographic entity that is the real-world city of Dallas, Texas, which is represented in Cyc’s KB by the term #$CityOfDallasTexas) • There is a logical field name for that entity, (TheReferentOfTheRow Usgs-Gnis) , even though it is only talked about by the explicit fields. The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB • how to access tuples of that database • We provide all the information needed for a JDBC connection script: • We assert, in the context (MappingMtFn Usgs-KS), all of these: (passwordForSKS Usgs-KS "geografy") (portNumberForSKS Usgs-KS 4032) (serverOfSKS Usgs-KS "sksi.cyc.com") (sqlProgramForSKS Usgs-KS PostgreSQL) (structuredKnowledgeSourceName Usgs-KS "usgs") (subProtocolForSKS Usgs-KS "postgresql") (userNameForSKS "sksi") The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB • that access will be fast, accurate, recent, complete • We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc. • We assert, in the context (MappingMtFn Usgs-KS): (schemaCompleteExtentKnownForValueTypeInArg Usgs-Gnis-LS USCity numberOfInhabitants 1) The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS).
USGS GNIS DB Cyc automatically gathers statistics like these, and uses them to order search: (resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "state")) TheEmptySet 60.0) (resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "primary_long") (PhysicalFieldFn Usgs-Gnis-PS "primary_lat") (PhysicalFieldFn Usgs-Gnis-PS "name")) (TheSet (PhysicalFieldFn Usgs-Gnis-PS "county") (PhysicalFieldFn Usgs-Gnis-PS "state")) 530.36)
Structured sources Semantic Knowledge Source Integration (SKSI) summary • Some of the knowledge needed will generally be in the Cyc KB already • Some will reside in already-mapped sources: data bases, web pages, simulators, etc. • For each needed new source, explain the meaning of its schema elements to Cyc • Write Cyc assertions to convey the meaning of each field, each polymorphism, each idiosyncratic entry code, plus meta-information: when this was created/updated, level of granularity, its sources, its degree of completeness, what it can do quickly, what it can do (slowly), how to access it, etc.
What Led to Cyc? The Need • Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984). The Opportunity
How “general knowledge” helps search • Query: “Someone smiling” find information by inference (+KB) • Caption: “A man helping his daughter take her first step”
How “general knowledge” helps search Query: “Show me pictures of strong and adventurous people” Caption: “A man climbing a rock face” find information by inference (+KB)
How “general knowledge” helps search Query: “Outdoor explosions in terrorist events Lebanon between 1990 and 2001” Document: “1993 pipe bombing on the patio of the Beirut Olive Garden” Text Document find information by inference (+KB)
+ domain knowledge How “general knowledge” helps search ^ Query: “Threats to low-flying US airliners in Lebanon” Document: “Hezballah buys ten SA-7’s.” Text Document find information by inference (+KB)
Find and clean (consistency-check) information by inference (+KB) If Pat and Jan are married, their date of marriage should be the same; their address is likely to be the same; their genders are likely to differ; and so on.
What Led to Cyc? • Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984).
Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world Cyc is… • The typical bird has 1 beak, 1 heart, lots of feathers,… • Hearts are internal organs; feathers are external protrusions • Most vehicles are steered by an awake, sane, adult,… human • Tangible objects can’t be in 2 (disjoint) places at once • Badly injuring a child is much worse than killing a dog • Causes temporally precede (i.e., start before) their effects • A stabbing requires 2 cotemporal and proximate actors • etc.
Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world Penitentiary EnglishWord-Plume WritingPen EnglishWord-Pen BirdFeather FrenchWord-Plume Authoring … … Cyc is… • Each of these represented in formal logic • Info. about a set of hundreds of thousands of terms • Language-independent ArabicWordForWritingPen Corral
Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world • Each of these represented in formal logic • Info. about a set of hundreds of thousands of terms • An inference engine that produces the same sorts of inferences from those that people would. • Interfaces so the system can communicate with people, data bases, spreadsheets, websites, etc. Cyc is…
Knowledge Users User Interface (with Natural Language Dialog) Knowledge Authors Other Applications Knowledge Entry Tools Cyc API Cyc Reasoning Modules Cyc Ontology & Knowledge Base Interface to External Data Sources External Data Sources Data Bases Web Pages Text Sources Other KBs
EVENT TEMPORAL-THING PARTIALLY-TANGIBLE-THING Upper Ontology ( a, b) a EVENTb EVENT causes(a, b)precedes(a, b) Core Theories ( m, a ) m MAMMAL a ANTHRAX causes( exposed-to( m, a ), infected-by( m, a ) ) Domain-Specific Theories Very specific information (some indirect, via SKSI) (ist FtLaudHolyCrossERCase#403921 (caused CutaneousAnthrax (SkinLesions Ahmed_al-Haznawit))) Every American has a president. Every American has a mother. y.x. Amer(x) president(x,y) x.y. Amer(x) mother(x,y) Painful Evolution of our Representationfrom Frames&Slots to Contextualized HOL First Order Predicate Calculus: unambiguous; enable mechanical reasoning Higher Order Logic (nth-order predicate calculus): contexts, predicates as variables, nested modals, reflection,…
Cyc is not monolithic The Knowledge Base is divided into thousands of contexts by: granularity, topic, culture, geospatial place, time,... Cyc is not committed to any one reasoning mechanism The inference engine is a community of 720 “agents” that attack every problem and, recursively, every subproblem (subgoal). One of these 720 is a general theorem prover; the others have special-purpose data structures/algorithms to handle the most important, most common cases, very fast.
Cyc is not monotonic 98% of its content is marked as merely being usually true. So reasoning in Cyc is default (gather up all the pro/con arguments, and compare them). Cyc is not committed to its own reasoning mechanisms Think of reasoning modules 721, 722, 723… as being all manner of external databases, simulators, translators…
Thing Intangible Thing Individual Sets Relations Spatial Thing Temporal Thing Partially Tangible Thing Space Time Paths Events Scripts Spatial Paths Logic Math Agents Physical Objects Borders Geometry Artifacts Living Things Organ- ization Materials Parts Statics Actors Actions Movement Life Forms Plans Goals State Change Dynamics Organizational Actions Types of Organizations Ecology Human Beings Human Activities Physical Agents Natural Geography Organizational Plans Human Organizations Plants Human Anatomy & Physiology Nations Governments Geo-Politics Human Artifacts Political Geography Agent Organizations Business & Commerce Politics Warfare Animals Emotion Perception Belief Human Behavior & Actions Sports Recreation Entertainment Social Behavior Products Devices Conceptual Works Purchasing Shopping Professions Occupations Weather Law Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Social Relations, Culture Business, Military Organizations Earth & Solar System Social Activities Transportation & Logistics Travel Communication Everyday Living Language Cyc Knowledge Base • Represented in: • First Order Logic • Higher Order Logic • Context Logic • Micro-theories Cyc contains: 15,000 Predicates 300,000 Concepts 3,200,000 Assertions General Knowledge about Various Domains Specific data, facts, and observations