1 / 55

Presented By: Kiran Kancharlapalli

Presented By: Kiran Kancharlapalli. DBMS - Topics 11 & 12. Semantic Interoperability. What is Semantic Interoperability?. A bility of  computer systems to transmit data with unambiguous, shared meaning Dat a must be made available between heterogeneous agents

oke
Download Presentation

Presented By: Kiran Kancharlapalli

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented By:KiranKancharlapalli DBMS - Topics 11 & 12

  2. Semantic Interoperability

  3. What is Semantic Interoperability? • Ability of computersystems to transmit data with unambiguous, shared meaning • Data must be made available between heterogeneous agents • Metadata must also be made available allowing a software agent to learn how to interpret the data • Document Type Definition • XML-Schema • RDF Annotations • Requirement to enable machine computable logic, inferring and knowledge discovery between information systems. • Results in Semantic Web

  4. How is it accomplished? • By adding data about the data (metadata), linking each data element to a controlled, shared vocabulary • The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system • Syntatic interoperability is a prerequisite for semantic interoperability • refers to the packaging and transmission mechanisms for data

  5. What is Semantic Web? • Will be able to provide justified answers to natural language questions • Current search engines provide lists of resources that are supposed to contain the answer • Knowledge rather than plain data would be retrieved i.e. data which is relevant to the user’s task • Social factors such as privacy and trust would also be taken into account

  6. Benefits • Search can often be frustrating because of the limitations of keyword-based matching techniques. • Users frequently experience one of two problems: • either get back no results or • too many irrelevant results. • The problem is that words can be synonymous (that is, two words have the same meaning) or polysemous(a single word has multiple meanings). • However, if the languages used to describe web pages were semantically interoperable, then the user could specify a query in the terminology that was most convenient, and be assured that the correct results were returned, regardless of how the data was expressed in the sources.

  7. Ontologies

  8. What are Ontologies? • Content theories possible about objects in a specified domain • A representation vocabulary, specialized to some domain or subject matter • Provide potential terms for describing knowledge about the domain • Translating the terms in an ontology from, say English to French, does not change the ontology conceptually

  9. What are Ontologies? • Designed to reuse across multiple applications and implementations

  10. Motivation • select EMPDAT from PERSTAB where POS=“mgmnt” • What does it mean? • PERSTAB is a table which lists employee data • What’s an employee? How is an employee different from a contractor? What if I want data on both? • Even if this information is available in English, a human has to read it

  11. Motivation (cntd…) • "Parenthood is a more general relationship than motherhood." • "Mary is the mother of Bill." • "Who are Bill's parents?“ • "Mary is the parent of Bill.” • that fact is not stated anywhere, but can be derived by a DAML application.

  12. More formally stated, given the statements (motherOfsubPropertyparentOf) (Mary motherOf Bill) • when stated in DAML, allows you to conclude (Mary parentOf Bill) • Java code or a stored procedure could do this sort of inference for facts in XML or SQL • But the DAML spec itself says the conclusion is true • In contrast, different Java code could reach a different conclusion

  13. Everything is not a nail • Ontology is not always the right tool for the job • Face recognition, vehicle control systems etc – not the right applications for ontology

  14. Many Ways to Use Ontology • As an information engineering tool • Create a database schema • Map the schema to an upper ontology • Use the ontology as a set of reminders for additional information that should be included • As more formal comments • Define an ontology that is used to create a DB or OO system • Use a theorem prover at design time to check for inconsistencies • For taxonomic reasoning • Do limited run-time inference in Prolog, a description logic, or even Java • For first order logical inference • Full-blown use of all the axioms at run time

  15. Upper Ontology • An attempt to capture the most general and reusable terms and definitions

  16. Motivation to capture Upper Ontology • Ontologies may have different names for the same things • type – a relation between a class and an instance • instance – a relation between a class and an instance • isa – a relation between a class and an instance • … • Ontologies may have the same name for different things, and no corresponding terms • before – a relation between two time points • before – a relation between two time intervals • Either use the same upper ontology, or at least map to a common upper ontology

  17. Some Formal Upper Ontologies • DOLCE • Cyc • SUMO

  18. Simple Methodology • Extract nouns and verbs from a source text • Find classes in SUMO for the nouns and verbs • Record a mapping as being either equal, subsuming or instance. • type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser • Create a subclass of SUMO if it's a subsuming mapping • Add properties to the subclass • reusing SUMO properties • extending SUMO properties by creating a &%subrelation of an existing property • Add English definition to the class • define constraints that express how the subclass is more specific than the superclass • Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously

  19. High Level Distinctions • The first fundamental distinction is that between ‘Physical’ (things which have a position in space/time) and ‘Abstract’ (things which don’t) Physical Abstract

  20. High Level Distinctions • Partition of ‘Physical’ into ‘Objects’ and ‘Processes’ Physical Object Process

  21. DBpedia:A Nucleus for a Web of Open Data • DBpedia.org is an effort to: • extract structured information from Wikipedia • make this information available on the Web under an open license • interlink the DBpedia dataset with other datasets on the Web

  22. Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates

  23. Extracting Structured Information from Wikipedia Wikipedia consists of • 6.9 million articles • in 251 languages • monthly growth-rate: 4% Wikipedia articles contain structured information • infoboxeswhich use a template mechanism • images depicting the article’s topic • categorization of the article • links to external webpages • intra-wiki links to other articles • inter-language links to articles about the same topic in different languages

  24. Traditional Web Browser Semantic Web Browsers Web 2.0 Mashups SNORQL Browser Linked Data Query Builder SPARQL Endpoint published via Virtuoso MySQL loaded into DBpedia datasets Categories Articles Infobox Extraction Wikipedia Dumps Article texts DB tables

  25. Extracting Infobox Data (RDF Representation)

  26. DBpedia Basics • The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. • The DBpedia.org project uses the   Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the  SPARQL query language to query this data. • At  Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

  27. The DBpedia Dataset • 1,600,000 concepts • including • 58,000 persons • 70,000 places • 35,000 music albums • 12,000 films • described by 91 million triples • using 8,141 different properties. • 557,000 links to pictures • 1,300,000 links external web pages • 207,000 Wikipedia categories • 75,000 YAGO categories

  28. Accessing the DBpedia Dataset over the Web 1. SPARQL Endpoint 2. Linked Data Interface 3. DB Dumps for Download

  29. SPARQL • SPARQL is a query language for RDF. • RDF is a directed, labeled graph data format for representing information in the Web. • This specification defines the syntax and semantics of the SPARQL query language for RDF. • SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

  30. The DBpedia SPARQL Endpoint • http://dbpedia.org/sparql • hosted on a OpenLink Virtuoso server • can answer SPARQL queries like • Give me all Sitcoms that are set in NYC? • All tennis players from Moscow? • All films by Quentin Tarentino? • All German musicians that were born in Berlin in the 19th century?

  31. Example • To know everything Bart wrote on blackboard board in season 12 of Simpson's: • The Simpson episode Wikipedia pages are the identified "things” that we would consider as the subjects of our RDF triples. • The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". • The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. entities SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag } Table

  32. Possible Improvements • Better data cleansing required. • Improvement in the classification. • Interlink DBpedia with more datasets. • Improvement in the user interfaces. • Performance • Scalability • More Expressiveness

  33. Questions for Discussion • DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors. • Which one is better approach? • Can Freebase or DBpedia be substitute for Wikipedia? • Freebase : Not good in that we have two similar things – Wikipedia, Freebase • DBPedia : Not good in that it extracts data from dump • How can we interlink Freebase & DBpedia? • What can be killer applications using Dbpedia? • If there is, okay • If there is no, do we really need a large general structured knowledge?

  34. Uncertainty propagation • Every physical quantity has : • A value or size • Uncertainty(or ‘Error’) • Units • Without these three things, no physical quantity is complete. • When quoting your measured result, follow the simple rules : Ex: A = 1.71 0.01m Always include Units ! ! (but if the quantity is dimensionless, say so) Alwaysquote main value to the same number of decimal places as the uncertainty Never quote uncertainty to more than 1 or 2 significant figures (this would make no sense)

  35. Terminology: ‘Uncertainty’ and ‘Error’ • The terms Uncertainty and Error are used interchangeably to describe a measured range of possible true values. • The meaning of the term Error is : • NOT the DIFFERENCE between your experimental result & that predicted by theory, or an accepted standard result ! • NOT a MISTAKE in the experimental procedure or analysis ! • Hence, the term Uncertainty is less ambiguous. Nevertheless, we still use terms like ‘propagation of errors’, ‘error bars’, ‘standard error’, etc. • The term “human error” is imprecise - avoid using this as an explanation of the source of error.

  36. e.g., if z = xn, then : Hence, for this particular function, the percent (or fractional) error in z is : D D æ z ö æ x ö = n ç ÷ ç ÷ è ø è ø z x Error Propagation using CalculusFunctions of one variable If uncertainty in measured x is Δx, what is uncertainty in a derived quantityz(x)? Error propagation is just calculus – you do this formally in the “Data Handling” course Basic principle is that, if (Δx)/xis small, then to first order: or...... just n times the percent error in x

  37. But, combining errors ALWAYS INCREASES total error - so make sure terms add with the same sign : It is better to add in quadrature i.e. “the root of the sum of the squares” : Error Propagation using CalculusFunctions of more than one variable Suppose uncertainties in two measured quantities x and yare : Δx andΔy , what is the uncertainty in some derived quantity z(x,y)? For such functions of 2 variables we use partial differentiation We can usually always handle error propagation in this way by calculus

  38. Simplified Error PropagationA short-cut avoiding calculus Instead of differentiating z/x, z/yetc, a simpler approach is also acceptable : 1. In the derived quantity z, replace x by x +Δx, say 2. Evaluate Δz in the approximation that Δx is small Ex. 1 : z = x + a , where a = constant Ex. 2 : z = bx , where b = constant Ex. 3 : z = bx2 , where b = constant

  39. Synthetic Data • Any production data applicable to a given situation that are not obtained by direct measurement • Used in a variety of fields as a filter for information that would otherwise compromise theconfidentiality of particular aspects of the data. • Many times the particular aspects come in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number,etc.)

  40. Importance • Obtaining actual or real data sets could be difficult, and sometimes impossible due to impediments such as • Privacy issues • Image control • Logistics issues • Time • Cost • Protecting information confidentiality • Data cannot be traced back to an individual • Certain conditions may not be found in the original data

  41. Importance (cntd.) • Used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment • By creating realistic behavior profiles of users and attackers • Ex: Intrusion Detection Systems are trained using Synthetic Data • Allow a baseline to be set • Ex: Researcher doing clinical trials generate synthetic data to aid in creating a baseline for future studies and testing • More or less realism could be exhibited according to the selected properties of the original data sets

  42. Synthetic Data Generation • Mostly Scenario based • Evaluating Information Analytics Software • Matching Data Mining Patterns • Evaluate quality of extraction algorithms • Specific Algorithms and generators for a scenario or a set of (similar) scenarios • Patterns from data mining techniques could be used to generate synthetic data sets

  43. Researchers frequently need to explore the effects of certain data characteristics on their models. • To help construct datasets exhibiting specific properties, such as autocorrelation or degree disparity, synthetic data could be generated having one of several types of graph structure: • random graphs • independent and identically distributed(i.i.d.) connected components • lattice graphshaving a ring structure • lattice graphs having a grid structure • forest fire graphs • cluster graphs with nodes arranged in separate clusters (cliques)

  44. Synthetic data is generated with simple forms of realism by: • Domain sampling within a field • Preserving cardinality relationships • In all cases, the data generation process follows the same process: • Generate the empty graph structure. • Generate attribute values based on user-supplied prior probabilities. • Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

  45. Data Quality • Some Definitions • The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. • The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. • Complete, standards based, consistent, accurate and time stamped.

  46. Data Quality • Data are of high quality if, • they are fit for their intended uses in operations, decision making and planning •  they correctly represent the real-world construct to which they refer • As data volume increases • the question of internal consistencywithin data arises, regardless of fitness for use for any external purpose • e.g. a person's age and birth date may conflict within different parts of a database

  47. Data Attributes • Nearly 200 such attributes are there and there is little agreement in their definition and measures • Most common are • Accuracy • Correctness • Currency • Completeness • Relevance

  48. Incorrect Data • Includes • invalid and outdated information – can originate from different data sources resulting from • data entry, or data migration and conversion projects • Total cost to the US economy due to data quality problems is over US$600 billion per annum 

More Related