DBpedia: A Nucleus for a Web of Open Data

DBpedia:A Nucleus for a Web of Open Data Original presentation by Christian Bizer, Freie Universität Berlin Sören Auer , Universität Leipzig Georgi Kobilarov, Freie Universität Berlin Jens Lehmann, Universität Leipzig Richard Cyganiak, Freie Universität Berlin Edited by Sangkeun Lee

DBpedia.org is a effort to : • extract structured information from Wikipedia • make this information available on the Web under an open license • interlink the DBpedia dataset with other datasets on the Web

Outline: 1. Extracting Structured Information from Wikipedia 2. The DBpedia Dataset 3. Accessing the DBpedia Dataset over the Web 4. Use Cases: • Improving Wikipedia Search • Royalty-Free Data Source for other Applications • Nucleus for the Emerging Web of Data

Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates

Extracting Structured Information from Wikipedia 􀀟 Wikipedia consists of • 􀁺 6.9 million articles • 􀁺 in 251 languages • 􀁺 monthly growth-rate: 4% 􀀟 Wikipedia articles contain structured information • 􀁺 infoboxes which use a template mechanism • 􀁺 images depicting the article’s topic • 􀁺 categorization of the article • 􀁺 links to external webpages • 􀁺 intra-wiki links to other articles • 􀁺 inter-language links to articles about the same topic in different languages

Overview of the DBpedia component

Traditional Web Browser Semantic Web Browsers Web 2.0 Mashups SNORQL Browser Linked Data Query Builder SPARQL Endpoint published via Virtuoso MySQL loaded into DBpedia datasets Categories Articles Infobox Extraction Wikipedia Dumps Article texts DB tables

Wikitext Syntax:

Extracting Infobox Data (RDF Representation): http://en.wikipedia.org/wiki/Calgary http://dbpedia.org/resource/Calgary dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “1079310”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council; ...

Question: • How good is the extraction from the markup in Wiki pages?

􀀟 Short and long abstracts in 10 different languages • dbpedia:Calgary • dbpedia:abstract “Calgary is the largest ...”@en ; • dbpedia:abstract “Calgary ist eine Stadt ...”@de . • 􀀟 Categorization information • dbpedia:Calgary • skos:subject dbpedia:Category_Cities_in_Alberta ; • skos:subject dbpedia:Host_cities_Olympic_Games . • 􀀟 Links to the original Wikipedia articles, pictures and relevant • external web pages • dbpedia:Calgary • foaf:page <http://en.wikipedia.org/wiki/Calgary> ; • dbpedia:wikipage-de<http://de.wikipedia.org/wiki/Calgary> ; • foaf:depiction <http://upload.wikimedia.org/thumb/3/32> ; • dbpedia:reference <http://www.calgary.ca> ; • dbpedia:reference <http://www.tourismcalgary.com>.

DBpedia Basics : The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

The DBpedia Dataset 􀀟 1,600,000 concepts 􀀟 including 􀁺 58,000 persons 􀁺 70,000 places 􀁺 35,000 music albums 􀁺 12,000 films 􀀟 described by 91 million triples 􀀟 using 8,141 different properties. 􀀟 557,000 links to pictures 􀀟 1,300,000 links external web pages 􀀟 207,000 Wikipedia categories 􀀟 75,000 YAGO categories

Accessing the DBpedia Dataset over the Web 1. SPARQL Endpoint 2. Linked Data Interface 3. DB Dumps for Download

SPARQL : • SPARQL is a query language for RDF. • RDF is a directed, labeled graph data format for representing information in the Web. • This specification defines the syntax and semantics of the SPARQL query language for RDF. • SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

The DBpedia SPARQL Endpoint • 􀀟 http://dbpedia.org/sparql • 􀀟 hosted on a OpenLink Virtuoso server • 􀀟 can answer SPARQL queries like • 􀁺 Give me all Sitcoms that are set in NYC? • 􀁺 All tennis players from Moscow? • 􀁺 All films by Quentin Tarentino? • 􀁺 All German musicians that were born in Berlin in the 19th century?

Interesting Example: • To know everything Bart wrote on blackboard board in season 12 of Simpson's: • The Simpson episode Wikipedia pages are the identified "things" that we would consider as the subjects of our RDF triples. • The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". • The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. entities SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag } Table

The Linked Data Interface: • A large body of information and knowledge is often already available in structured form, yet not accessible as such on the Web. • Integrating open data provides real value. It saves the time and effort to re-enter data that is already out there and it leaves the data and editing where it belongs: at its origin. • Linked Data on the Web can be accessed using Semantic Web browsers, just as the traditional Web of documents is accessed using HTML browsers. • Semantic Web browsers enable users to navigate between different data sources by following RDF links. It also allows the robots of Semantic Web search engines to follow these links to crawl the Semantic Web.

The Linked Data Interface • 􀀟 The project follows the Linked Data principles • All concepts are identified using Uniform Resource Identifier references. URI is a compact string of characters used to identify or name a resource. • 􀁺 The Linked Data interface can be used by • Semantic Web Browsers, like • - DISCO Hyperdata Browser • - Tabulator Browser • - OpenLink RDF Browser • Semantic Web Crawlers, like • - Zitgist (Zitgist LLC, USA) • - SWSE (DERI, Ireland) • - Swoogle (UMBC, USA )

DBpedia Use Cases 1. Improving Wikipedia Search 2. Royalty-Free Data Source for other Applications 3. Nucleus for the Emerging Web of Data

Improving Wikipedia Search (Various Interfaces)

Query to find all web browser S/W at http://wikipedia.askw.org :

Improving Wikipedia Search

Royalty-Free Data Source for other Applications 􀀟 DBpedia is published under GNU Free Documentation License 􀀟 Example use case: SPARQL generated tables within webpages

Nucleus for the Emerging Web of Data 􀀟 W3C SWEO Linking Open Data Project 􀀟 Over all size of the dataset: over 1 billion RDF triples 􀀟 Out-bound RDF links within DBpedia: 75,000

Proposed Improvements: 􀀟 Better data cleansing required. 􀀟 Improvement in the classification. 􀀟 Interlink DBpedia with more datasets. 􀀟 Improvement in the user interfaces. 􀀟 Performance 􀀟 Scalability 􀀟More Expressiveness

Discussion • DBpedia is the first and largest source of structured data on the Internet covering topics of general knowledge. • DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors. • Which one is better approach? • Can Freebase or DBpedia be substitute for Wikipedia? • Freebase : Not good in that we have two similar things – Wikipedia, Freebase • DBPedia : Not good in that it extracts data from dump • How can we interlink Freebase & DBpedia? • What can be killer applications using Dbpedia? • If there is, okay • If there is no, do we really need a large general structured knowledge?

DBpedia: A Nucleus for a Web of Open Data