170 likes | 325 Views
Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. OC Working Group – 21.01.2014 Serge Tymaniuk. Overview. Introduction Methodology Results Questions. Introduction.
E N D
Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis OC Working Group – 21.01.2014 Serge Tymaniuk
Overview • Introduction • Methodology • Results • Questions
Introduction • Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1) • (1) Data and Web Science Group, University of Mannheim, Germany • (2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands • Features: • Analysis of RDFa, Microdata, and Microformats adoption on the Web • Based on large public Web crawl of 3 billion HTML pages • Aims at revealing the main topical areas of the published data and different vocabularies within each topical area • Examine structural richness (which properties are used to described popular types of entities)
Web Crawl • Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3. • 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains. • Crawling conducted between Jan. - June 2012 • Compressed size of the corpus is 48TB • Relies on the PageRank algorithm
Data Extraction Process • Parsing framework is executed on Amazon EC2 • Relies on Anything To Triples (http://any23.apache.org/) parsing library from Apache • Rapidminerdata mining framework is used for vocabulary term co-occurrence analyses
Results: Overall picture • Structured data was discovered within 369Mout of 3Bpages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%)
Results: Deployment by FORMAT * PLDs – Public Level Domains (i.e. websites) * URLs – HTML pages
Results: Deployment by POPULARITY * According to Alexa Internet Inc. (AL) list of the most frequently visited websites
Results: Deployment on the same Website • 93,5% of all website which has structured data use only a single format
Results: Deployment of RDFa Most frequently used properties co-occurring with all the 4 most frequently used OGP classes: Most frequently used RDFa classes: • Alexa top 100 websites that use RDFa: • IMDB • Microsoft News Portal • BBC
Results: Deployment of Microdata Most frequently used Microdata classes: • Alexa top 100 websites that use Microdata: • eBay • Microsoft Corp. • Apple Inc.
Results: Deployment of Microformats • Alexa top 100 websites that use Microformats: • Wikipedia • Adobe • Taobao marketplace Most frequently used Microformats classes:
Results: Topical Domains • Dominant Domains of the published data: • Persons and Organizations (by all 3 formats) • Blog- and CMS-related metadata (by RDFa and Microdata) • Navigational metadata (by RDFa and Microdata) • Product data (by all 3 formats) • Event data (by Microformats)
Results: Structural Richness • Only a small set of generic properties is used to describe entities: • Instances of OGP class “Product” are described by title, url, site_name, description in most classes • Instances of Schema class “Product” is described largely only by name and description. Additional extraction techniques has to be employed for deeper understanding
Sources Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: http://hannes.muehleisen.org/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf
Thank you for your attention! Questions?