October 17, 2012

Big Data Management:Storing and Querying the Semantic WebArtem ChebotkoDepartment of Computer ScienceUniversity of Texas – Pan Americanchebotkoa@utpa.eduhttp://faculty.utpa.edu/chebotkoa October 17, 2012

Background: Data Management • Data Base • File System • Legacy Database • Relational Database • Object-Oriented Database • XML Database • RDF Database • NoSQL Database

Background: Big Data • Big Data • Web-Scale Data • Many companies work at this level: Google, Yahoo!, LinkedIn, Facebook, Twitter, Amazon, Walmart, etc. • Many more companies will have to meet Big Data this decade • “As of 2012, about 2.5 exabytes of data are created each day, and the number is doubling every 40 months or so” and • “Walmart collects more than 2.5 petabytes of data every hour” (Harvard Business Review, October 2012) 1 EB = 1,000,000 TB 1 PB = 1,000 TB

Background: Big Data • What can you do with 1 PB of data? • Data Scientist: The Sexiest Job of the 21st Century (HBR, October 2012) • Data Management Skills • Programming Skills • Data Mining and Data Analysis Skills • Social Skills • Business Understanding

The Semantic Web – a neat, meaningful mate for the messy, unstructured Big Data

WWW and Semantic Web World Wide Web – Web of Linked Documents Enormous collection of information (Big Data) intended for people to share and use Keyword-based search Semantic Web – Web of Data An emerging vision to make information collected by WWW processable by machines Computational knowledge-based search/answering Big Data

Motivating Example Web Search Example: Find a professor in UTPA who authored an article published in Data & Knowledge Engineering in 2009. This information is available in two different pages of my website welcome.html publications.html

Example (cont): traditional search Google search in Nov. 2009 finds 184 documents One of them mentions my name It is not displayed on the first page of the results It contains my name and affiliation, but no information about the DKE article Google search in Oct. 2010 finds ~19,800 documents Four of them mention my name and affiliation They are not displayed on the first page of the results No information about the DKE article 2011 & 2012: no noticable improvement

Example (cont): traditional search What went wrong? Keyword-based search interprets my query as a list of syntactic words: professor, UTPA, data, article, publish, knowledge, engineering, 2009 It searches for a document that contains as many matching words as possible PageRank is “biased” towards keyword ‘UTPA’ Moreover, my two pieces of information are viewed as lists of syntactic words. The pieces are not linked!

Example (cont): semantic search How can we do better? Encode the two pieces of information as machine-interpretable data Link them Express (automatically) the natural language query in a machine-friendly query language

Example (cont): encoding <resource3> <type> <Journal>. <resourse3> <title> “Data & …”. <resource3> <published> <resource4>. <resource4> <type> <Article>. <resource4> <title> “Semantics …”. <resource4> <year> “2009”. <resource4> <author> <resource5>. <resource1> <type> <Professor>. <resource1> <name> “Artem Chebotko”. <resource1> <worksIn> <resource2>. <resource2> <type> <University>. <resource2> <name> “UTPA”. <resource1> <sameAs> <resource5>.

Example (cont): linked data! Information from two sources was integrated

Example (cont): query Find a professor in UTPA who authored an article published in Data & Knowledge Engineering in 2009. SELECT ?name WHERE { ?p <type> <Professor>. ?p <name> ?name ?p <worksIn> ?u. ?u <type> <University>. ?u <name> “UTPA”. ?j <type> <Journal>. ?j <title> “Data & …”. ?j <published> ?a. ?a <type> <Article>. ?a <title> “Semantics …”. ?a <year> “2009”. ?a <author> ?p. } Result: ?name = “Artem Chebotko” This is the exactanswer to our question

Semantic Web Technologies

Semantic Web Current State Semantic search/indexing http://sindice.com/ Over 664 million Semantic Web documents as of today ~400 million Semantic Web documents in 2011 ~ 140 million Semantic Web documents in 2010 ~ 70 million in 2009

Semantic Web Current State (cont) Semantic Web datasets: DBPedia (~2 billion triples) US Census Data (>1 billion triples) UniProt (>600 million triples) BestBuy (>27 million triples) Semantic Web can potentially grow the size of Web (> 22 billion pages)

Linking Open Data Project (March 2009)

Linking Open Data Project (Sept 2010)

Linking Open Data Project (Sept 2011)

Semantic Web Data Management:Research at UTPA

Semantic Web Data Management:Research at UTPA Roadmap: Research Goals and Current Projects S2ST ProvBase Future Directions

Research Goal and Current Projects Goal: efficient storage and querying of large Semantic Web data sets Projects: S2ST: Relational RDF Database Management System (RRDBMS) http://s2stproject.cs.panam.edu/ ProvBase: Semantic Web Database in the Cloud http://provbase.cs.panam.edu/

S2ST Overview http://s2stproject.cs.panam.edu/

S2ST Definition

S2ST Architecture

S2ST Main Functions Create logical schema User specifies a template for a database schema that will store RDF data Very flexible. Supports the following approaches to schema design: Generic Schema-aware Schema-oblivious Data-driven User-driven Hybrid

S2ST Main Functions (cont) Schema mapping Creates physical schema and database schema in an RDBMS. Data mapping Maps RDF triples into relational tuples and inserts them into the database Query mapping Maps SPARQL queries into SQL that can be evaluated by an RDBMS Most complex mapping

SPARQL-to-SQL Query Translation • Generic • Reusable • Semantics • preserving • Correct

S2ST Fact Sheet Next-generation relation RDF store Relational RDF Database Management System Supports user-driven schema design like in relational databases Supports semantics-preserving SPARQL-to-SQL query translation Supports generic schema, data and query mapping algorithms Supports ~20 RDBMS backends, including Oracle, DB2, PostgreSQL, MySQL, and SQLServer

S2ST Applications VIEW Scientific workflow provenance metadata management GEO-SEED Web services RDF data management

Future Directions Inference support Query optimization Data mapping algorithms Data browsing interface Distributed data management Testing and performance evaluation Data and query visualization Applications

ProvBase Overview http://provbase.cs.panam.edu/

ProvBase: Distributed RDF Provenance Database http://hadoop.apache.org • Based on • Hadoop Wins Terabyte Sort Benchmark: One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. This is the first time that either a Java or an open source program has won.

ProvBase: Distributed RDF Provenance Database http://hbase.apache.org • Based on • HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. • Sample BigTable:

ProvBase Architecture

Future Directions SPARQL optional graph pattern support Inference support Query optimization GUI Testing and performance evaluation Data and query visualization Applications

Other Projects • Relational Algebra Toolkit (RAT) http://rat.cs.panam.edu • The University of Texas Provenance Benchmark (UTPB) http://faculty.utpa.edu/chebotkoa/utpb • Student Research Organizer • k-Nearest Keyword Search in RDF Graphs

Thank You!Questions? Artem ChebotkoDepartment of Computer ScienceUniversity of Texas – Pan Americanchebotkoa@utpa.edu http://faculty.utpa.edu/chebotkoa

October 17, 2012

October 17, 2012

Presentation Transcript

October 17, 2012

Michael T. Lee October 17, 2012

RESPONSIBLE CARE WORKSHOP 17 october 2012

October 17, 2012

Phase 2/Unit One October 17, 2012

Addis Ababa, 17-21 October 2012

Wednesday October 17, 2012

Monthly Meeting October 17, 2012

October 17, 2013

October 16 & 17, 2012

October 17, 2013

Managing Committee Meeting – 17 October 2012, Paris

George Mason University October 17, 2012

October 17, 2009

October 17, 2012

October 17, 2012

E101 Section 6 October 17, 2012

October 17

October 17, 2012 Ion Stoica inst.eecs.berkeley/~cs162

October 17, 2012

17 October 2012

October 17, 2008

October 17, 2012

October 17, 2012

Presentation Transcript

October 17, 2012

Michael T. Lee October 17, 2012

RESPONSIBLE CARE WORKSHOP 17 october 2012

October 17, 2012

Phase 2/Unit One October 17, 2012

Addis Ababa, 17-21 October 2012

Wednesday October 17, 2012

Monthly Meeting October 17, 2012

October 17, 2013

October 16 &amp; 17, 2012

October 17, 2013

Managing Committee Meeting – 17 October 2012, Paris

George Mason University October 17, 2012

October 17, 2009

October 17, 2012

October 17, 2012

E101 Section 6 October 17, 2012

October 17

October 17, 2012 Ion Stoica inst.eecs.berkeley/~cs162

October 17, 2012

17 October 2012

October 17, 2008

October 16 & 17, 2012