550 likes | 711 Views
Presented By: Kiran Kancharlapalli. DBMS - Topics 11 & 12. Semantic Interoperability. What is Semantic Interoperability?. A bility of computer systems to transmit data with unambiguous, shared meaning Dat a must be made available between heterogeneous agents
E N D
Presented By:KiranKancharlapalli DBMS - Topics 11 & 12
What is Semantic Interoperability? • Ability of computersystems to transmit data with unambiguous, shared meaning • Data must be made available between heterogeneous agents • Metadata must also be made available allowing a software agent to learn how to interpret the data • Document Type Definition • XML-Schema • RDF Annotations • Requirement to enable machine computable logic, inferring and knowledge discovery between information systems. • Results in Semantic Web
How is it accomplished? • By adding data about the data (metadata), linking each data element to a controlled, shared vocabulary • The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system • Syntatic interoperability is a prerequisite for semantic interoperability • refers to the packaging and transmission mechanisms for data
What is Semantic Web? • Will be able to provide justified answers to natural language questions • Current search engines provide lists of resources that are supposed to contain the answer • Knowledge rather than plain data would be retrieved i.e. data which is relevant to the user’s task • Social factors such as privacy and trust would also be taken into account
Benefits • Search can often be frustrating because of the limitations of keyword-based matching techniques. • Users frequently experience one of two problems: • either get back no results or • too many irrelevant results. • The problem is that words can be synonymous (that is, two words have the same meaning) or polysemous(a single word has multiple meanings). • However, if the languages used to describe web pages were semantically interoperable, then the user could specify a query in the terminology that was most convenient, and be assured that the correct results were returned, regardless of how the data was expressed in the sources.
What are Ontologies? • Content theories possible about objects in a specified domain • A representation vocabulary, specialized to some domain or subject matter • Provide potential terms for describing knowledge about the domain • Translating the terms in an ontology from, say English to French, does not change the ontology conceptually
What are Ontologies? • Designed to reuse across multiple applications and implementations
Motivation • select EMPDAT from PERSTAB where POS=“mgmnt” • What does it mean? • PERSTAB is a table which lists employee data • What’s an employee? How is an employee different from a contractor? What if I want data on both? • Even if this information is available in English, a human has to read it
Motivation (cntd…) • "Parenthood is a more general relationship than motherhood." • "Mary is the mother of Bill." • "Who are Bill's parents?“ • "Mary is the parent of Bill.” • that fact is not stated anywhere, but can be derived by a DAML application.
More formally stated, given the statements (motherOfsubPropertyparentOf) (Mary motherOf Bill) • when stated in DAML, allows you to conclude (Mary parentOf Bill) • Java code or a stored procedure could do this sort of inference for facts in XML or SQL • But the DAML spec itself says the conclusion is true • In contrast, different Java code could reach a different conclusion
Everything is not a nail • Ontology is not always the right tool for the job • Face recognition, vehicle control systems etc – not the right applications for ontology
Many Ways to Use Ontology • As an information engineering tool • Create a database schema • Map the schema to an upper ontology • Use the ontology as a set of reminders for additional information that should be included • As more formal comments • Define an ontology that is used to create a DB or OO system • Use a theorem prover at design time to check for inconsistencies • For taxonomic reasoning • Do limited run-time inference in Prolog, a description logic, or even Java • For first order logical inference • Full-blown use of all the axioms at run time
Upper Ontology • An attempt to capture the most general and reusable terms and definitions
Motivation to capture Upper Ontology • Ontologies may have different names for the same things • type – a relation between a class and an instance • instance – a relation between a class and an instance • isa – a relation between a class and an instance • … • Ontologies may have the same name for different things, and no corresponding terms • before – a relation between two time points • before – a relation between two time intervals • Either use the same upper ontology, or at least map to a common upper ontology
Some Formal Upper Ontologies • DOLCE • Cyc • SUMO
Simple Methodology • Extract nouns and verbs from a source text • Find classes in SUMO for the nouns and verbs • Record a mapping as being either equal, subsuming or instance. • type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser • Create a subclass of SUMO if it's a subsuming mapping • Add properties to the subclass • reusing SUMO properties • extending SUMO properties by creating a &%subrelation of an existing property • Add English definition to the class • define constraints that express how the subclass is more specific than the superclass • Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously
High Level Distinctions • The first fundamental distinction is that between ‘Physical’ (things which have a position in space/time) and ‘Abstract’ (things which don’t) Physical Abstract
High Level Distinctions • Partition of ‘Physical’ into ‘Objects’ and ‘Processes’ Physical Object Process
DBpedia:A Nucleus for a Web of Open Data • DBpedia.org is an effort to: • extract structured information from Wikipedia • make this information available on the Web under an open license • interlink the DBpedia dataset with other datasets on the Web
Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates
Extracting Structured Information from Wikipedia Wikipedia consists of • 6.9 million articles • in 251 languages • monthly growth-rate: 4% Wikipedia articles contain structured information • infoboxeswhich use a template mechanism • images depicting the article’s topic • categorization of the article • links to external webpages • intra-wiki links to other articles • inter-language links to articles about the same topic in different languages
Traditional Web Browser Semantic Web Browsers Web 2.0 Mashups SNORQL Browser Linked Data Query Builder SPARQL Endpoint published via Virtuoso MySQL loaded into DBpedia datasets Categories Articles Infobox Extraction Wikipedia Dumps Article texts DB tables
DBpedia Basics • The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. • The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. • At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.
The DBpedia Dataset • 1,600,000 concepts • including • 58,000 persons • 70,000 places • 35,000 music albums • 12,000 films • described by 91 million triples • using 8,141 different properties. • 557,000 links to pictures • 1,300,000 links external web pages • 207,000 Wikipedia categories • 75,000 YAGO categories
Accessing the DBpedia Dataset over the Web 1. SPARQL Endpoint 2. Linked Data Interface 3. DB Dumps for Download
SPARQL • SPARQL is a query language for RDF. • RDF is a directed, labeled graph data format for representing information in the Web. • This specification defines the syntax and semantics of the SPARQL query language for RDF. • SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.
The DBpedia SPARQL Endpoint • http://dbpedia.org/sparql • hosted on a OpenLink Virtuoso server • can answer SPARQL queries like • Give me all Sitcoms that are set in NYC? • All tennis players from Moscow? • All films by Quentin Tarentino? • All German musicians that were born in Berlin in the 19th century?
Example • To know everything Bart wrote on blackboard board in season 12 of Simpson's: • The Simpson episode Wikipedia pages are the identified "things” that we would consider as the subjects of our RDF triples. • The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". • The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. entities SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag } Table
Possible Improvements • Better data cleansing required. • Improvement in the classification. • Interlink DBpedia with more datasets. • Improvement in the user interfaces. • Performance • Scalability • More Expressiveness
Questions for Discussion • DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors. • Which one is better approach? • Can Freebase or DBpedia be substitute for Wikipedia? • Freebase : Not good in that we have two similar things – Wikipedia, Freebase • DBPedia : Not good in that it extracts data from dump • How can we interlink Freebase & DBpedia? • What can be killer applications using Dbpedia? • If there is, okay • If there is no, do we really need a large general structured knowledge?
Uncertainty propagation • Every physical quantity has : • A value or size • Uncertainty(or ‘Error’) • Units • Without these three things, no physical quantity is complete. • When quoting your measured result, follow the simple rules : Ex: A = 1.71 0.01m Always include Units ! ! (but if the quantity is dimensionless, say so) Alwaysquote main value to the same number of decimal places as the uncertainty Never quote uncertainty to more than 1 or 2 significant figures (this would make no sense)
Terminology: ‘Uncertainty’ and ‘Error’ • The terms Uncertainty and Error are used interchangeably to describe a measured range of possible true values. • The meaning of the term Error is : • NOT the DIFFERENCE between your experimental result & that predicted by theory, or an accepted standard result ! • NOT a MISTAKE in the experimental procedure or analysis ! • Hence, the term Uncertainty is less ambiguous. Nevertheless, we still use terms like ‘propagation of errors’, ‘error bars’, ‘standard error’, etc. • The term “human error” is imprecise - avoid using this as an explanation of the source of error.
e.g., if z = xn, then : Hence, for this particular function, the percent (or fractional) error in z is : D D æ z ö æ x ö = n ç ÷ ç ÷ è ø è ø z x Error Propagation using CalculusFunctions of one variable If uncertainty in measured x is Δx, what is uncertainty in a derived quantityz(x)? Error propagation is just calculus – you do this formally in the “Data Handling” course Basic principle is that, if (Δx)/xis small, then to first order: or...... just n times the percent error in x
But, combining errors ALWAYS INCREASES total error - so make sure terms add with the same sign : It is better to add in quadrature i.e. “the root of the sum of the squares” : Error Propagation using CalculusFunctions of more than one variable Suppose uncertainties in two measured quantities x and yare : Δx andΔy , what is the uncertainty in some derived quantity z(x,y)? For such functions of 2 variables we use partial differentiation We can usually always handle error propagation in this way by calculus
Simplified Error PropagationA short-cut avoiding calculus Instead of differentiating z/x, z/yetc, a simpler approach is also acceptable : 1. In the derived quantity z, replace x by x +Δx, say 2. Evaluate Δz in the approximation that Δx is small Ex. 1 : z = x + a , where a = constant Ex. 2 : z = bx , where b = constant Ex. 3 : z = bx2 , where b = constant
Synthetic Data • Any production data applicable to a given situation that are not obtained by direct measurement • Used in a variety of fields as a filter for information that would otherwise compromise theconfidentiality of particular aspects of the data. • Many times the particular aspects come in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number,etc.)
Importance • Obtaining actual or real data sets could be difficult, and sometimes impossible due to impediments such as • Privacy issues • Image control • Logistics issues • Time • Cost • Protecting information confidentiality • Data cannot be traced back to an individual • Certain conditions may not be found in the original data
Importance (cntd.) • Used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment • By creating realistic behavior profiles of users and attackers • Ex: Intrusion Detection Systems are trained using Synthetic Data • Allow a baseline to be set • Ex: Researcher doing clinical trials generate synthetic data to aid in creating a baseline for future studies and testing • More or less realism could be exhibited according to the selected properties of the original data sets
Synthetic Data Generation • Mostly Scenario based • Evaluating Information Analytics Software • Matching Data Mining Patterns • Evaluate quality of extraction algorithms • Specific Algorithms and generators for a scenario or a set of (similar) scenarios • Patterns from data mining techniques could be used to generate synthetic data sets
Researchers frequently need to explore the effects of certain data characteristics on their models. • To help construct datasets exhibiting specific properties, such as autocorrelation or degree disparity, synthetic data could be generated having one of several types of graph structure: • random graphs • independent and identically distributed(i.i.d.) connected components • lattice graphshaving a ring structure • lattice graphs having a grid structure • forest fire graphs • cluster graphs with nodes arranged in separate clusters (cliques)
Synthetic data is generated with simple forms of realism by: • Domain sampling within a field • Preserving cardinality relationships • In all cases, the data generation process follows the same process: • Generate the empty graph structure. • Generate attribute values based on user-supplied prior probabilities. • Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.
Data Quality • Some Definitions • The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. • The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. • Complete, standards based, consistent, accurate and time stamped.
Data Quality • Data are of high quality if, • they are fit for their intended uses in operations, decision making and planning • they correctly represent the real-world construct to which they refer • As data volume increases • the question of internal consistencywithin data arises, regardless of fitness for use for any external purpose • e.g. a person's age and birth date may conflict within different parts of a database
Data Attributes • Nearly 200 such attributes are there and there is little agreement in their definition and measures • Most common are • Accuracy • Correctness • Currency • Completeness • Relevance
Incorrect Data • Includes • invalid and outdated information – can originate from different data sources resulting from • data entry, or data migration and conversion projects • Total cost to the US economy due to data quality problems is over US$600 billion per annum