800 likes | 883 Views
INTRODUCTION TO ARTIFICIAL INTELLIGENCE. Massimo Poesio LECTURE 10: Knowledge and The Social Web. `CYC convinced the AI community that creating a commonsense knowledge base by hand is impossible’ (Massimo, Lecture 1) . That may depend on how many people you put on to it!. THE SOCIAL WEB.
E N D
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo PoesioLECTURE 10: Knowledge and The Social Web
`CYC convinced the AI community that creating a commonsense knowledge base by hand is impossible’ (Massimo, Lecture 1) That may depend on how many people you put on to it!
THE SOCIAL WEB • Increasingly, the Web is becoming not just a way to facilitate information exchange or commercial transactions, but also a tool to facilitate socialization (Facebook, LinkedIn, etc) • Also, where information can be collectively created
WIKIPEDIA The free encyclopedia that anyone can edit • Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. • Wikipedia's articles have been written collaboratively by volunteers around the world. • Almost all of its articles can be edited by anyone who can access the Wikipedia website. ----http://en.wikipedia.org/wiki/Wikipeida
WIKIPEDIA • Wikipedia is: 1. domain independent • it has a large coverage 2. up-to-date • to process current information 3. multilingual • to process information in many languages
Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates
Encyclopedic knowledge in coreference resolution [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. ….. [The agency] said that because MCI's offer had expired AT&T couldn't continue to offer its discount plan.
Why Wikipedia may help addressing the encyclopedic knowledge problem http://en.wikipedia.org/wiki/FCC: The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressionalstatute (see 47 U.S.C.§ 151 and 47 U.S.C.§ 154).
Another interesting scenario A fresh mandate for [Mr Ahmadinejad] would, say his critics, consecrate the “revolution within a revolution” he has been trying to effect since his surprise electoral triumph in 2005. Best known to outsiders for his bellicose grandstanding, [the incumbent] is more familiar to Iranians as a radical and hyperactive populist who has used the tacit backing of his fellow conservative, Mr Khamenei, greatly to expand the powers of the presidency. Source: It could make a big difference, The Economist, Mar 19th 2009
Why Wikipedia may help addressing the encyclopedic knowledge problem
Wikipedia as Ontology • Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. • However, it is more… • Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) • Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. • Up to date: Current and emerging concepts are absorbed timely. * Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900–901.
Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept. • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links.
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category.
The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society
Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category. • Polysemous concepts are disambiguated by Disambiguation Pages.
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA • Taxonomic information: category structure • Attributes: infobox, text
Deriving a taxonomy from Wikipedia (AAAI 2007) • Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007) • Induce a subsumption hierarchy
INFOBOXES • Collaborative content • Semi-structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in 1848 ... | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]] ...
DBPEDIA DBpedia.org is a effort to : • extract structured information from Wikipedia • make this information available on the Web under an open license • interlink the DBpedia dataset with other datasets on the Web
The DBpedia Dataset 1,600,000 concepts including 58,000 persons 70,000 places 35,000 music albums 12,000 films described by 91 million triples using 8,141 different properties. 557,000 links to pictures 1,300,000 links external web pages 207,000 Wikipedia categories 75,000 YAGO categories
REPRESENTING EXTRACTED INFORMATION The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.
Extracting Infobox Data (RDF Representation): http://en.wikipedia.org/wiki/Calgary http://dbpedia.org/resource/Calgary dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “1079310”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council; ...
SPARQL : • SPARQL is a query language for RDF. • RDF is a directed, labeled graph data format for representing information in the Web. • This specification defines the syntax and semantics of the SPARQL query language for RDF. • SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.
The DBpedia SPARQL Endpoint • http://dbpedia.org/sparql • hosted on a OpenLink Virtuoso server • can answer SPARQL queries like • Give me all Sitcoms that are set in NYC? • All tennis players from Moscow? • All films by Quentin Tarentino? • All German musicians that were born in Berlin in the 19th century?
Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing efforts Other initiatives: Citizen Science, Cognition and Language Laboratory, … This has been taken advantage of in AI Open Mind Commonsense (Singh) (collecting facts) Semantic Wikis WEB COLLABORATION FOR KNOWLEDGE ACQUISITION www.phrasedetectives.com
WEB COLLABORATION PROJECTS • Open Mind Common Sense – Singh • Crater mapping (results) – Kanefsky • Learner / Learner2 / 1001 Paraphrases – Chklovski • FACTory – CyCORP • Hot or Not – 8 Days • ESP / Phetch / Verbosity / Peekaboom – von Ahn • Galaxy Zoo – Oxford University www.phrasedetectives.com
OPEN MIND COMMONSENSE • A project started in 2000 by Push Singh to take advantage of people’s collaboration to collect commonsense
FROM OPENMIND COMMONSENSE TO CONCEPT NET • ConceptNet (Havasi et al, 2009) is a semantic network extracted from OpenMind Commonsense assertions using simple heuristics
FROM OPENMIND COMMONSENSE FACTS TO CONCEPTNET • A lime is a very sour fruit • isa(lime,fruit) • property_of(lime,very_sour)
GAMES WITH A PURPOSE • Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK • GWAP do not rely on altruism or financial incentives to entice people to perform certain actions • The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP • Games at www.gwap.com • ESP • Verbosity • TagATune • Other games • Peekaboom • Phetch
ESP • The first GWAP developed by von Ahn and their group (2003 / 2004) • The problem: obtain accurate description of images to be used • To train image search engines • To develop machine learning approaches to vision • The goal: label the majority of the images on the Web