Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications Pieter Bellekens, Kees van der Sluijs, William van Woensel, Sven Casteleyn, Geert-Jan Houben ICWE 2008 07/16/2008

Introduction Context Semantic Web (SW) grows Ever more SW datasets are available Legacy data can usually be transformed in SW terms Class of Web application emerges that want to utilize the potential of these Web-sized SW datasets Practical Application iFanzy, a commercial personalized EPG Uses large sets of integrated SW data Runner up app in the SW-challenge; showed that the class of large scale SW-applications needs more research

iFanzy Personalized EPG Multi-platform Set-top box, focus on experience Web, focus on interactivity Central server architecture Integrating data from various heterogeneous sources TV-Grabbers, IMDb, Movie-trailers, etc Context sensitive recommendations Recommendations based on semantics structure of e.g. genre ontology, connections between programs, etc

iFanzy

iFanzy Datasets (Live) Heterogeneous data sources Online TV guides, XMLTV format sample: 1.278.718, daily updated Online movie databases, IMDBtext dumps currently 53.268.369 (full), 7.986.199 (trimmed) trailers from Videodetective.com (API) Broadcast descriptions, BBC-backstage, TV-Anytime format (domain model) sample: 91.447, daily updated Various vocabularies and ontologies

iFanzy Datasets cont.

Converting TV Metadata in RDF/OWL Input source 1: Input source 2: • <program title="Match of the Day"> • <channel>BBC One</channel> • <start>2008-03-09T19:45:00Z</start> • <duration>PT01H15M00S</duration> • <genre>sport</genre> • </program> • <program channel="NED1"> • <source>http://foo.bar/</source> • <title>Sportjournaal</title> • <start>20080309184500</start> • <end>20080309190000</end> • <genre>sport nieuws</genre> • </program> Translation to TV-Anytime in RDF/OWL • <TVA:ProgramInformation ID="crid://foo.bar/0001"> • <hasTitle>Sportjournaal</hasTitle> • <hasGenre rdf:resource="TVAGenres:3.1.1.9"/> • </TVA:ProgramInformation> • <TVA:Schedule ID="TVA:Schedule_0001"> • <serviceIDRef>NED1</serviceIDRef> • <hasProgram crid="crid://foo.bar/0001"/> • <startTime rdf:resource="TIME:TimeDesc_0001"/> • </TVA:Schedule> • <TIME:TimeDescription ID= "TIME:TimeDesc_0001"> • <year>2008</year> • <month>3</month> • <day>9</day> • <hour>18</hour> • <minute>45</minute> • <second>0</second> • </TIME:TimeDescription>

Converting Vocabularies in RDF/OWL • <Term termID="3.1"> • <Name xml:lang="en">NON-FICTION/INFORMATION</Name> • <Term termID="3.1.1“> • <Name xml:lang="en">News</Name> • <Term termID="3.1.1.9"> • <Name xml:lang="en">Sport News</Name> • <Definition xml:lang="en">News of sports</Definition> • </Term> • </Term> • </Term> • <Term termID="3.2"> • <Name xml:lang="en">SPORTS</Name> • <Term termID="3.2.1“> • <Name xml:lang="en">Athletics</Name> • <Term termID="3.2.1.1"> • … • </Term> • </Term> • </Term> Translation of TV-Anytime genres to RDF/OWL using SKOS • <TVAGenres:genre ID="TVAGenres:3.1.1.9"> • <rdfs:label>Sport News</rdfs:label> • <skos:broader rdf:resource="TVAGenres:3.1.1"/> <skos:related rdf:resource="TVAGenres:3.2"/> • </TVAGenres:genre> • <TVAGenres:genre ID="TVAGenres:3.2"> • <rdfs:label>Sport</rdfs:label> • <skos:related rdf:resource="TVAGenres:3.1.1.9"/> • </TVAGenres:genre> • <TVAGenres:genre ID="TVAGenres:3.1.1"> • <rdfs:label>News</rdfs:label> • <skos:narrower rdf:resource="TVAGenres:3.1.1.9"/> • <skos:broader rdf:resource="TVAGenres:3.1"/> • </TVAGenres:genre>

Aligning and Enriching Vocabularies Alignment of Genre vocabularies The content sources use several different genre vocabularies Semantic enrichment of Genre vocabulary Via SKOS narrower, broader and related relations Enrichment of the user model Import of social network profile adds interests in programs, persons (actors, directors,...), locations, etc. XMLTV:documentaire  TVA:”Documentary” IMDB:Thriller  TVA:”Thriller” IMDB:Sci-Fi  TVA:”Science Fiction” News –skos:narrower-> Sports News => Original Term hierarchy Sport News –skos:related-> Sport => Partial label matches Skating –skos:related-> ‘Ice skating’ => Partial label matches • ‘American Football’ -skos:related-> Rugby => Domain expert

Aligning and Enriching Vocabularies Semantic enrichment of TV metadata with IMDB movie descriptions Programs are matched across sources Use part of relations in a geographical hierarchy to relate locations in the different sources Alignment of date/time descriptions to Time ontology concepts to allow temporal reasoning <time:year>2006</time:year> <time:day>01</time:day> <time:hour>12</time:hour> “2006-01-01T12:00:00” “Buono, il brutto, il cattivo, Il (1966)” “The Good, the Bad and the Ugly” “White Plains” “New York”  “USA”

Using the Semantic Graph Recommendations are generated based on usage data, the RDF/OWL graph and behavior analysis Search functionality uses the graph to show connections between items Showing semantically related content by following the relationships Interface visualization genres and locations in the interface can be browsed based on their relations to other concepts

Scalability & Performance Issues • Large scale SW-applications face performance issues with current day SW tools • Current RDF databases are not performance-mature • Especially for complex queries • Inference is time consuming or space intensive • RDF databases are generic; do not use specific knowledge about the sources • Target: Efficient access to our data • Low latency, users expect quick response from Web applications • Web 2.0 allows asynchronous updates • We need to be able to scale to thousands of users

Technologies and Strategies Technological choices • RDF Database: Sesame (version 1 and 2) • Query Language: SeRQL • We looked at different data decomposition strategies • Vertical Decomposition • Horizontal Decomposition • We applied several application specific optimizations • Using Relational Database where possible • Using freetext search engine

Natural Solution: One big dataset • All sources in one repository • Pro: • Data is highly integrated • One query to get all data • Con: • Maintenance can be hard • The bigger the store, the longer query execution times (i.e. also for simple queries) • Some typical iFanzy queries together with execution times: Query1: All programs with genre ‘drama’ (or one of its subgenres) Query2: All programs with genre ‘drama’ and a keyword in the program metadata (title, synopsis and keywords) Query3: All programs with a keyword in the program metadata (title, synopsis and keywords) Query4: All programs with genre ‘drama’ and a keyword in the program metadata or the person metadata (person name)

Decomposition Table X can be decomposed in: x1, x2,…,xn Vertical decomposition (splitting properties) n Query results from the decompositionneed to be combined to find the final result Building the final result gets more complicated as more tables are involved • Horizontal decomposition (splitting instances) • n Query results from the decomposition need to be united via a UNION to find the final result • If the result set needs to be ordered, ordering needs to be done after all query execution

Vertical Decomposition Splitting the data sources based on properties Genres, Geo and Synonyms (WordNet) are split off Relations between sources are not broken due to uniqueness of URIs • Result of one query is input of the next in the query pipeline • E.g. synonyms found in WordNet are used to query the data • Different strategies influence performance greatly (see table)

Horizontal Decomposition Splitting the data sources based on instances The BBC and XMLTV datasets(which have identical structures)are separated into two tables Joining the results is a simple UNION Retrieve from one sourceuntil enough results are found Queries to the split sources can be executed in parallel

Horizontal Decomposition cont. • The biggest data source (the IMDb set) is also accountable for the biggest delay in responsiveness • While containing nearly one million movies, only a fraction are also know by the general public • Indicator: The more votes a movie received, the more known it is • Trimming the IMDb database based on nr of votes (see table) • Filtering all movies which have more than 500 votes resulted in 11.500 movies or 7.986.199triples in the database • Also querying time was reduced severely

Reasoning optimization • In RDF, we can reason over facts to deduce new facts • Inference can be pre-calculated •  More triples in database • Inference can be taken into account while querying •  Much more complex query • Inference for sublocations (“California”: 8877 sublocations) • Inference for subgenres (“Action”: 10 subgenres)

Further optimization • Different types of databases • Some well-structured data repositories can be saved in relational databases • Different versions of Sesame, or different triple store back-ends for Sesame can have severe impact on performance • Using the LIMIT clause where applicable • At the interface side, users usually can browse results in chunks of 20, whereas more can be retrieved on request • If the result set needs to be ordered, this advantage is lost • Keyword Indices • Keeping indices over all strategic metadata fields helps when the user searches for keywords • Using a Lucene index allows misspellings as well, while maintaining high speed

Conclusions General conclusions • Maturing SW provides lots of available data • Data integration and alignment necessary. Involves some manual transformation, but it is surmountable • Many options to tune performance are possible by combining techniques iFanzy Conclusions • Concrete product that was produced with bleeding edge technology – but we showed that it can be done • SW technologies matured enough to become commercially interesting: iFanzy starts to sell

Current and Future Work • Benchmarking with many alternative SW data backends • OWLIM, Oracle, Jena, Mulgara, SWI-Prolog, etc • Parallellization using top-end servers • Scaling to thousands of users • Further research on Semantic recommendation algorithms • Combining SW techniques with collaborative filtering • User tests and product refinement • First: 300 customers have been selected for user tests • User interviews and usage data as feedback to improve application

Questions… http://www.ifanzy.nl 23

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications

Presentation Transcript

Semantic Web Applications

Semantic Web Applications

GADS*: Using Web Services to access large data sets

“Semantic Web” Applications in Bioinformatics

Efficient Record Linkage in Large Data Sets

Efficient Gaussian Process Regression for large Data Sets

Manipulating Large Data Sets

Simple, Efficient, Portable Decomposition of Large Data Sets

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations

Efficient Clustering of Large EST Data Sets on Parallel Computers

using large data sets

Applications of Semantic Web

Semantic Web Applications

Integrated Web Applications

Semantic Access to Data from the Web

Very large data sets

using large data sets

using large data sets

Characterizing Semantic Web Applications

Manipulating Large Data Sets