470 likes | 642 Views
QUASAR query language and system. Luying Chen and Michael Benedikt Computer Science Department, University of Oxford Evgeny Kharlamov KRDB research centre , Free University of Bozen -Bolzano. QUASAR system [Quasar’12]. QUASAR system is about QU erying
E N D
QUASAR query language and system Luying Chen and Michael BenediktComputer Science Department, University of Oxford Evgeny KharlamovKRDB research centre, Free University of Bozen-Bolzano
QUASAR system [Quasar’12] • QUASAR system is about • QUerying • Annotations • Structure And • Reasoning • QUASAR is • a query answering system • to query annotated data • and exploit the structure of the data • together with logical reasoning over annotations
QUASAR system • QUASAR = Querying Annotations Structure And Reasoning • Annotations come from annotated data • What is this data? • What is the source of this data? • Structure is data / documents’ structure • Which documents? • Why are they annotated? • Reasoning over annotations to improve quality of query answering • Why reasoning is possible? • Why reasoning is beneficial?
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
Semantically annotated Web • Goal: • to nest semantics within existing content on web pages • to help search engines, crawlers and browsersfind the right data • Person: • name • photo • URL • ... text anno-tatedtext
Standards for semantic markup • Microformats • started in 2003 • small data islandswithin HTML pages • Small set of fixed formats • hcard: people, companies, organizations, and places • XFN : relationships between people • hCalendar: calendaring and events • RDFa: Resource Description Framework – in – attributes • proposed in 2004, W3C recommendation • serializationformatforembedding RDF data into HTML pages • canbeusedtogetherwithanyvocabulary, e.g. FOAF • Microdata • alternativetechniquesforembeddingstrucuted data • proposed in 2009, comeswith HTML 5
Is semantic markup important? • Schema.orginitiative: • started on June 2011 • initiated by Bing, Google, Yahoo! • they propose: to mark up / annotate websites with metadata • they support: Microdata
Is semantic markup important? • Metadata by Schema.org: • Person • Organization • Event • Place • Product • ... • 200+ types
Who uses semantic markup? • Common Crawl foundation • goal: building and maintaining an open crawl of the Web • WebDataCommons.orgproject • goal: extracting Microformats, Microdata,RDFa from Common Crawl corpus • Feb 2012: • processed 1.4 billion HTML pages of CC corpus • 20.9 Terabyte of compressed data • this is a big fraction of the Web
Who uses semantic markup? • 1.4 billion HTML pages processes • 188 millions of them contain structural datain Microformat, Microdata, RDFa [CB’12] • This data is 3.2 billions RDF triples 13% of the HTML pages contain structured (meta) data
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
Automatic documents annotation • There are more and more systems that do automatic text annotation • OpenCalais, Evri API, Alchemy API, Zemanta, ... • How they work: • intelligent processing of textual data • use of machine learning • use of natural language processing • ... Goal of annotation: transforming text and webpages into knowledge
What are annotated documents? Annotated document, Screenshot from OpenCalais • Annotated document is • a sequence of tokens • with annotations overlaying them • Each annotation has • a span: start and end toke • type: • concept (e.g., Person), • sentiment (e.g., positive), etc. • canonical name (e.g., Ferdinand Magelan) • URI (e.g., a link to DBpedia) • accuracy of recognition • ...
Annotated doc = bag of annotations Annotated document, Screenshot from OpenCalais • ABox statements + metadata
Types of Annotations • Concept annotations: • Person(Ferdinand Magellan) • Continent(Europe) • (n-ary) Relationship annotation, i.e., events and facts: • Person_Career(John II, King, Political, current) • General_Relation(Another expl., name, Magellan) • Person_Travel(Henry Hudson, Delaware, past) • Born_In(Magellan, Portugal) • Travel(Magellan, Spain, September Past) • Sentimental annotations: • Positive(the first in Europe) • Neutral(Mediterranean Sea) • Negative(died last year)
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
Issues with missing information • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • Tipples are missing lots of information: • Who discovered the triple? -- reliability of triples • In which corpus, paragraphit appears? -- coordinates of triples • What is the URI of an annotated object? -- disambiguation • .... We claim that missing information is vitalfor answering queries
Issues with sets vs. (ordered) bags • Return places visited by Magellan • Available triples: • Magellan Type Person -- occurs 50 times (in the corpus) • Siberia Type Place -- occurs 2 times • Philippines Type Country -- occurs 240 times • Charles Type City -- occurs 1time • Triple store is a set of triples • Every triple has the same “weight” or “importance” • document order and distance between triples is ignored • Some triples are in the triple set • due to annotators mistake or • they are noise To avoid irrelevant triples: ordered bags or triples with weights are needed
Issues with joins • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • triple (Magellan Visited Philippines) is absent • Correlations between triples are missing • How can we join triples? • Standard way: using values • It does not help in our case Structure (same paragraph), names of annotations, etc. is a way to join triples
Issues with “being in the box” • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • How can we find out that (Country SubClassOf Place)? • Schema might not be available • How can we be sure that Charles is indeed a city? • annotators make mistakes Using ngexternal sources of knowledge is a way to go: DBpedia, Yago, ...
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
QUASAR data model • Nested objects: • annotation • snippet • assertion • arg[i] • naive name (string) • canonical name (string) • a list of URI (list of strings) • Strings: • corpus document, paragraph, sentence nr. • predicate • annotator • annotation type
Query answering over annotated docs • We want to retrieve annotations with specified • doc location • type • predicates • entities • sentiments • URIs • annotators • confidence • ... • QUASAR is an annotation orientedquery language • Return annotations about places visited by Magellan
Example QUASAR queries • Return annotations from the first 2 par-s of the corpusSELECT a FROM explorer_corpus.Annotation aWHERE a.snippet.paraNum <= 2 • Return annotations found by Open CalaisSELECT a FROM explorer_corpus.Annotation aWHERE a.annotator = "OpenCalais"
Example QUASAR queries • Return event annotations SELECT a FROM explorer_corpus.Annotation aWHERE a.annotationType= "event" • Return annotations about personsSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion.predicate= "Person"The samequery in atom-basednotation:...WHERE a.assertion= Person(?x)
Example QUASAR queries • Return annotations about Magellan • Fuzzy match:SELECT a FROM explorer_coprus.AnnotationaWHERE a.assertion.arg[0]like”Magellan” • Exact match:SELECT a FROM explorer_coprus.Annotation aWHERE a.assertion.arg[0] = ”Magellan”
Example QUASAR queries • Which places has Magellan visited? • Return (annotations about)countries located in the same paragraph as(annotations with the assertion) Person(Magellan)SELECT a FROM explorer_corpus.Annotationa, explorer_corpus.Annotationb WHERE a.assertion = Country(?x) and a.snippet.docNum = b.snippet.docNum and a.snippet.paraNum= b.snippet.paraNumand b.assertion.predicate= "Person"and b.assertion.arg[0] like "Magellan"
QUASAR queries: general form • QUASAR query language combines • SQL syntax with • object oriented navigation • At this moment we support conjunctive queries only • Three clauses of queries: SELECT annotation | attribute of annotation FROM annotation set a, ..., annotation set n WHERE conditions on annotations output annotations from one of the annotation sets output attribute of annotations list of sets of annotations filter on annotations: conditions on annot. attributes, joins of annotations
Are we happy with quality of answers? • There are too fewexpectedanswers • Country(Philippines) – where are the cities? • Place(Atlantic Ocean) – how can we avoid oceans? • How to find all relevant places visited by Magellan?a.assertion.predicate= ”Country”|“Province“|”City”| ... • How to get read of Oceans?a.assertion.predicate = not”Ocean” • Annot. vocab.s (concepts, roles ...) are flat • Annotators cannot expand queries automatically=> User has to do it and write many or complex queries. • How can it be avoided? • We use ontologies to • address the “too few expected” answers probl. • by expanding queries
TBox reasoning in QUASAR • Return all (explicit and implicit) places which are not OceansSELECT a FROM explorer_corpus.AnnotationaWHERE a.assertion= ?X(?y) [ontologyFilter: subClassOf(?X,"Place") anddisjointWith(?X,"Ocean") ] • There are many tools for TBox reasoning • Pellet, Racer, Jena, ... • We use Jena for TBox reasoning and support • subclass of • disjoint with • We allow to upload ontologies for reasoning • For the demo we use the ontology of DBpediaextended with disjointness assertions
Are we happy with quality of answers? • There are too many wrong answers:Country(John II) and Person(Strait of Magelan ) • Annotators make errors • How can we do semantic check on the results of annotators? • Knowledge bases (KBs) can be used to check quality of answers • We use available Knowledge Bases to • address the “too many wrong” answers problem • by exploiting them as filters
Query answering over KBs in QUASAR • Return all places known by DBpediato be populatedSELECTa FROM explorer_corpus.AnnotationaWHERE a.assertion= Place(?y) [ontologyFilter: Populated(?y)] • Return all organizations known by DBpediato be educat. institutes located in BolzanoSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion = Organization(?y) [ontologyFilter: EducationInstitute(?y) andlocatedIn(?y, "Bolzano") ]
Query answering over KBs in QUASAR • There are tools to support query answering over KBs • Quest [Quest] • Owlim[Owlim] • ... • QUASAR uses REQUIEM [RQ] and supports • conjunctive queries • over KBs with ontologies that have • subclass-of • disjointness • In QUASAR users can choose KBsto be used for reasoning • For demo we use DBpedia
QUASAR philosophy of query answering • QUASAR = Querying Annotations Structure and Reasoning • External ontologies are used to • do query expansion • increase number of answers • Annotated documents are used to • retrieve annotations, their attributes • filter annotations using • structure of documents • metadata of annotations • External KBs are used to • filter out wrong instances • reason over instances
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
Top-k • Answer set comparable to corpus size • annotators are able to discover a lot of entities • the same entity can be annotated with several annotations • combinations of several annotators make it even worse • Ranking is needed • Directions: • deterministic ranking based on • document structure (closer to a reference point, higher in ranking) • frequency (higher the frequency of an annot., higher the ranking) • reliability of annotators • ... • probabilistic ranking based on statistics of mistakes
Visualization • What would be the best way to display answers? • to make the answers more intuitive • good aggregation mechanisms due to large answer sets • System does not support projections of annotations at the moment • How to visualize different projections? • User studies are needed
Users feedback • A form of an (indirect) feedback look is currently present: • user asks query • used observes answer set • user refine query • go to 1. • This is a standard feedback approach in search engines: • users sends to Google keywords • user refines keywords if the result is not good • We want to incorporate a direct feedback: • user should be able to rate answers • good – bad • keep – dismiss • based on the feedback the system should adjust the answer set
Probabilistic query answers • Now the answers are deterministic • We want answers to be: • (Magellan Visited Philippines) with probability 0.7 • We work on a probabilistic model • it combines several annotators based on their reliability • the model is an annotator itself • it is a form of probabilistic transducer • it produces annotations with probabilities • probabilities are based on an aggregate opinion of other annotators
Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary
Summary • Annotations are an important reality of the today’s Web • 13% of the Web data has it • any text can be easily annotated using automatic annotators • There is a need in query answering • techniques and • tools in order leverage annotated data for intelligent information search • Current approaches to query answering over triple storesare not adequate or at least hard to adopt directly • QUASAR response to the problem: • a data model • a query language • a demo system
References • [CB’12] C. Bizer: Topology of the Web of Data. Joined keynote talk at LWDM2012 and BEWEB2012, EDBT workshops, Berlin, Germany, March, 2012. • [Quest] http://obda.inf.unibz.it/protege-plugin/quest/quest.html • [Owlim] http://www.ontotext.com/owlim • [RQ] http://www.cs.ox.ac.uk/projects/requiem/index.html • [Quasar’12] QUASAR: Querying Annotation, Structure, and Reasoning. L. Chen, M. Benedikt and E. Kharlamov. In Proc. of EDBT, Berlin, March 2012. Demonstration. • [Alchemympi] www.alchemyapi.com/api/entity/ • [Evriapi] www.evri.com/ • [Jenaapi] jena.sourceforge.net • [Opencalais]www.opencalais.com
References • [KIM’04] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff. Semantic annotation, indexing, and retrieval. J. Web Semantics, 2(1):49 – 79, 2004. • [Docqs’10] [11] M. Zhou, T. Cheng, and K. C.-C. Chang. DoCQS: a prototype system for supporting data-oriented content query. In SIGMOD, 2010.