Semantic Web In Industry

Semantic Web In Industry R. Guha

Two Levels of the Semantic Web • Deep Semantic Web: • Intelligent agents performing inference • Semantic Web as distributed AI • Small problem … the AI problem is not yet solved • Shallow Semantic Web: using SW/Knowledge Representation techniques for • Data integration • Search • Is starting to see traction in industry

Integration: The new buzzword in bussiness • Huge explosion in the number of new databases, applications, documents, … in the 90s • Lots of redundancy, duplication … => high inefficiency • Economic pressures forcing consolidation and efforts to reduce inefficiency • Two aspects to integration: Process & Data • Process integration depends on data integration

Data Integration for Science • Many experimental fields will generate more data in the next 2 years than exists today • Large part of research consists of writing programs to analyze data, e.g., NASA • Tools to normalize, share, integrate data stuck in the 80s (ftp, perl, …) • Semantic Web could create a “web of data” that changes all this. • Example of the Internet Observatory

Varieties of Data Integration: Data Transformation • Data Transformation Example • Contact Information in SAP, Siebel, PeopleSoft, … • We want to reflect updates in one data source into another XSLT, etc. App. Server Clarify Siebel PeopleSoft

Varieties of Data Integration: Data Aggregation • Data Aggregation Example • Clinical trial data at Stanford, UCSF, Mayo … • We want to give a Meta-analyst a uniform view of data from these different clinical trials • Example of how this would have helped recent meta studies such as the estrogen study Relational Views DBMS Meta-Analyst UCSF Stanford Mayo

Data Integration Layers • Coping with software from different vendors • Oracle vs. DB2 vs. SQL Server … this is a solved problem • Coping with different formats • Relational vs. XML vs. ISAM… this too is a solved problem • Coping with different schemas • Solved for the small case where one person understands all the schemas • No products for the case where it is truly distributed • We know how to do it in theory, but lots of practical problems • Coping with data from unknown sources • Wide open … lots of unsolved problems

Typical Data Integration Methodology • Use a common namespace of terms for the concepts in the domain of the data sources being integrated, e.g., Employee, Customer, Patient, weight, height, bodyTemperature, … • Mappings relate data items in data sources to terms in namespace • Transformation algorithms map queries in terms of common namespace into corresponding queries in terms of data source vocabularies • Background knowledge about terms essential for transformations … e.g., Employee subClassOf Person, 2 people with the same last name, first name and street address are likely to be the same, I.e., common namespace is really an Ontology • Mappings and common namespace are the workhorse

Role of Semantic Web in Data Integration • The XML stack (XML, XSD, XPath, XQuery, …) does not have the concepts (objects, classes, properties, …) required for representing ontologies • RDF/S does … • Neither of the them have a language for expressing mappings • But RDF/S, being closer to logic, has more of the machinery that is required

Kinds of Mappings • Simple structural • DB1.patient.weight corresponds to Patient’s weight • Conditional structural • If DB1.patient.type equals Outpatient then DB1.patient.foo corresponds to Patient’s visits duration … • Term mappings • CA in DB1 corresponds to California in domain namespace • Object with ssn 7687667 in database 1 corresponds to object with id “aksdks” in database 2

Challenges and non-challenges in data integration • Non-challenge: algorithms for doing the transformations (ISI, MCC, SU & AT&T) • Engineering Challenges • Creating large, useful ontologies that are shared by many • Creating mappings • Research Challenges • Semantic Drift • Fuzzy terms, probabilistic mappings • Trust

Engineering Challenges • Creating large, detailed ontologies is complex and expensive • But it is happening … CrossWorlds for business concepts, MAGE, etc. for medicine • Danger: some of them might turn out to be proprietary • Creating mappings is tedious and time consuming • Object mappings pose special challenges • Mappings need to be dynamic and constantly updated

Research Challenges with mappings • Semantic Drift • The meaning of terms as interpreted by different members of a community, over time could drift • Cyc experience shows that Description Logic mechanisms are not adequate for either detecting or fixing these • Fuzzy mappings • E.g., walmart’s concept of chair is similar to but not the same as MOMA’s concept of chair • Probabilistic mappings • There is a 82% likelihood that Michael Jordan in database 1 is the same as Michael Jordan in database 2

Other data web related challenges • Trust: How should the program know whether to trust some new data source? • Without this, we will only have closed systems • Options: centralized approaches like UDDI or decentralized approaches like WOTs • Inverse trust: how can I trust you not to indiscriminately distribute my data? A big issue in fresh scientific data • Systems challenges • Caching • Preventing accidental DOS attacks

Forecast for SW and Data Integration • We already have a number of data integration tools on the market • We are seeing the first generation of ontology based data integration tools from small companies • At least some of the big players will probably have some offerings for doing data integration based on Semantic Web concepts in the near future • Whether they use Semantic Web formats and acronyms is an open question … • These common vocabularies will exhibit very strong network effects

Semantic Web for Search: Going beyond search as Location Bar • Keywords  a particular page • Typically a home page or well known hub page • United airlines  www.united.com • Unix  gnu.org, linux.org, freebsd.org • Search as a smarter location bar • Page rank is ideally suited for this • This is largely a solved problem

Varieties of Search: Research searches • User is searching for info about something • Could be directed – user is looking for a particular property • Price of something, location of some event, … • Or undirected – user is looking for some general class of properties • Reviews/feedback on product, info on person or country • If there is no hub page on the thing, existing search engines perform very poorly • New focus is on this class of searches

Semantic Web for Search • Keyword based approaches haven’t made significant advances since PageRank • Improvements may be gained by adding a modicum of understanding about the *object* denoted by the search query • Improvements not just in search itself but also in the relevance of search related advertising

Basic Issues • Need database of potential objects user may be referring to, along with some properties of the object … e.g., its type • Too many objects to manually construct DB • At least 300 million distinct object references on Web • If it does know something more about the search term’s denotation, (e.g., it denotes a musician), how can the search engine do better?

Building the Web KB • Many different automated approaches • Simple natural language processing (Riloff, TAP, …) • Scrappers • Machine Learning • Most commercial efforts lead to proprietary KBs • Huge opportunity for wider SW community • Collaborate to actually create the KB

Using the KB • Word Sense Disambiguation., e.g., MSN Search, Teoma • Incorporating data feeds into search results. E.g., MSN with popular musicians • Incorporating object type specific actions. E.g., Google with addresses and stock symbols • Coming soon … KB construction driven by ads

Conclusions • Please help Eric miller

Semantic Web In Industry