1 / 24

Semantic Web & Semantic Web Processes

Semantic Web & Semantic Web Processes. A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix , Inc. Special Thanks: Cartic Ramakrishnan , Karthik Gomadam.

uyen
Download Presentation

Semantic Web & Semantic Web Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Web & Semantic Web Processes A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix, Inc Special Thanks: Cartic Ramakrishnan, Karthik Gomadam

  2. Semantic (Web) Technology Applications Part III (b) Representative Enterprise Applications

  3. Visualizer Content: BSBQ Application

  4. (1.33) – 12/06/00 - ABC (2.53) - 12/06/00 - CBS (5.16) - 12/06/00 - ABC (2.46) - 12/06/00 - FOX (1.33) - 12/06/00 - NBC (5.33) - 12/06/00 (1.33) - 12/06/00 - CBS (3.57) - 12/06/00 - CBS (4.27) - 12/06/00 - ABC (3.44) - 12/06/00 - FOX (7.24) - 12/06/00 - CBS Description Produced by : CNN Posted Date : 12/07/2000 Reporter :David Lewis Event : Election 2000 Location : Tallahassee, Florida, USA People : Al Gore TALLAHASSEE, Florida (CNN) – Though the two presidential candidates have until noon Wednesday to file briefs in Al Gore's appeal to the Florida Supreme Court, the outcome of two trials set on the same day in Leon County, Florida, may offer Gore his best hope for the presidency. Democrats in Seminole County are seeking to have 15,000 absentee ballots thrown out in that heavily Republican jurisdiction -- a move that would give Gore a lead of up to 5,000 votes statewide. Lawyers for the plaintiff, Harry Jacobs, claim the ballots should be rejected because they say County Elections Supervisor Sandra Goard allowed Republican workers to fill out voter identification numbers on 2,126 incomplete absentee ballot applications sent in by GOP voters, while refusing to allow Democratic workers to do the same thing for Democratic voters. The GOP says that suit, and one similar to it from Martin County, demonstrates Democratic Party politics at its most desperate. Gore is not a party to either of those lawsuits. On Tuesday, the judge in the Gore Demands That Recount Restart Gore Says Fla. Can't Name Electors Bush Meets Colin Powell at Ranch Market Tumbles on Earnings Warning Barak Outlines His Peace Plan -- Breaking News -- (1.33) - 12/06/00 - ABC (2.33) - 12/06/00 - CBS (3.12) - 12/06/00 - NNS (0.32) - 12/06/00 - CBS (1.33) - 12/06/00 - CBS

  5. Automatic 3rd party content integration Focused relevant content organized by topic (semantic categorization) Related relevant content not explicitly asked for (semantic associations) Automatic Content Aggregation from multiple content providers and feeds Competitive research inferred automatically Equity Research Dashboard with Blended Semantic Querying and Browsing Sheth et al, 2002 Managing Semantic Content for the Web

  6. Global Bank Aim Legislation (PATRIOT ACT) requires banks to identify ‘who’ they are doing business with Problem Volume of internal and external data needed to be accessed Complex name matching and disambiguation criteria Requirement to ‘risk score’ certain attributes of this data Approach Creation of a ‘risk ontology’ populated from trusted sources (OFAC etc); Sophisticated entity disambiguation Semantic querying, Rules specification & processing Solution Rapid and accurate KYC checks Risk scoring of relationships allowing for prioritisation of results Full visibility of sources and trustworthiness Amit Sheth, From Semantic Search & Integration to Analytics, Proceedings of the KMWorld, October 26, 2004 , Santa Clara, CA.

  7. Watch list Organization Hamas FBI Watchlist WorldCom Company The Process • Ahmed Yaseer: • Appears on Watchlist ‘FBI’ • Works for Company ‘WorldCom’ • Member of organization ‘Hamas’ member of organization appears on Watchlist Ahmed Yaseer works for Company

  8. Establishing New Account User will be able to navigate the ontology using a number of different interfaces Scores the entity based on the content and entity relationships Global Investment Bank World Wide Web content BLOGS,RSS Public Records Law Enforcement Regulators Watch Lists Un-structure text, Semi-structured Data Semi-structured Government Data Example of Fraud Prevention application used in financial services

  9. Law Enforcement Agency • Aim • Provision of an overarching intelligence system that provides a unified view of people and related information • Problem • Need to create unique entities from across multiple disparate, non-standardised databases; Requirement to disambiguate ‘dirty’ data • Need to extract insight from unstructured text • Approach • Multiple database extractors to disambiguate data and form relevant relationships • Modelling of behaviours/patterns within very large ontology (6Mn+ entities) • Solution • Merged and linked case data from multiple sources using effective identification, disambiguation, and link analysis • Dynamic annotation of documents • Single query across multiple datasets • 360 view of an individual and relevant associations

  10. Complex querying and characteristic modelling across information sources Profile Creation Complex Querying Summary of Results Investigation Profile Creation Complex Querying Summary of Results Investigation • Application of bespoke and pre-configured ‘profiles’ for detailed investigation

  11. Profile Creation Complex Querying Summary of Results Investigation • Profiling based on link analysis through indirect relationships with other cases and information • Profile based on direct matching with case characteristics • User configurable scoring profiles

  12. Profile Creation Complex Querying Summary of Results Investigation Gisondi, white ford expedition, main street, assault, traffic offences • Free text searching across aggregated information sources

  13. Profile Creation Complex Querying Summary of Results Investigation • Unified view of direct and indirect results that best match the complex query and the profile

  14. Profile Creation Complex Querying Summary of Results Investigation • Direct and indirect relationship scoring driven by risk weightings • Aggregated knowledge from disparate sources • Knowledge Annotation of known entities from within free text

  15. Profile Creation Complex Querying Summary of Results Investigation • Scoring of key characteristics to drive relevance to original profile and query • Identification of investigation path • Visualisation of results

  16. Legitimate Document Access Control Application demo

  17. Ontology Quality • Many real-world ontologies may be described as semi-formal ontologies • populated with partial or incomplete knowledge • may contain occasional inconsistencies, or occasionally violate constraints (e.g. all schema level constraints may not be observed in the knowledgebase that instantiates the ontology schema) • often ontology is populated by many persons or by extracting and integrating knowledge from multiple sources • analogy is “dirty data” which is usually a fact of life in most enterprise databases. From Semantic Search & Integration to Analytics

  18. Ontology Representation Expressiveness • Applications vary in terms of expressiveness of representation needed. • Trade-off between expressive power and computational complexity applies both to knowledge creation/maintenance and to inference mechanisms for such languages. It is often very difficult to capture the knowledge that instantiates the more expressive constructs/constraints. • Many business applications end up using models/languages that lie closer to less expressive languages. • On the other hand, we have seen a few applications, especially in scientific domains such as biology, where more expressive languages are needed.

  19. Ontology Size / Population / Freshness • Ontology population is critical. Among the ontologies developed by Semagix or using its technology, a median size of ontology is over 1 million instances/facts and relationship instances each (at least two have exceeded 10 million instances). This level of knowledge makes the system very powerful (as it is applied . Furthermore, in many cases, it is necessary to keep these ontologies current or updated with facts and knowledge on a daily or more frequent basis. Both the scale and freshness requirements dictate that populating ontologies with instance data needs to be automated.

  20. Metadata Extraction Large scale metadata extraction and semantic annotation is possible. IBM WebFountain [Dill et al 2003] demonstrates the ability to annotate on a Web scale (i.e., over 4 billion pages), while Semagix Freedom related technology [Hammond et al 2002] demonstrates capabilities that work for a few million documents per day per server. However, the general trade-off of depth versus scale applies. Storage and manipulation of metadata for millions to hundreds of millions of content items requires database techniques with the challenge of improving performance and scale in presence of more complex structures

  21. Semantic Technology Building Blocks • A vast majority of the Semantic (Web) Technology Applications that have been developed or envisioned rely on three crucial capabilities: ontology creation, semantic annotation (metadata extraction) and querying/inferencing. Enterprise-scale applications share many requirements in these three respects with pan Web applications. All these capabilities must scale to many millions of documents and concepts (rather than hundreds to thousands) for current applications, and applications requiring billions of documents and concepts have also been discussed (esp. in intelligence and government space) but not yet deployed.

  22. Primary Technical Capabilities/Key Research Challenges • Two of the most basic “semantic” techniques are “named entity identification”, and “semantic ambiguity resolution”. [It would be nice to have relationship extraction too.] A tool for annotation is of little value if it does not support ambiguity resolution. Both require highly multidisciplinary approaches, borrowing for NLP/lexical analysis, statistical and IR techniques and possibly machine learning techniques. A high degree of automation is possible in meeting many real-world semantic disambiguation requirements, although pathological cases will always exist and complete automation is unlikely.

  23. Content Heterogeneity • Support for heterogeneous content is key – it is too hard to deploy separate products within a single enterprise to deal with structured, semi-structured and unstructured data/content management. New applications involve extensive types of heterogeneity in format, media and access/delivery mechanisms (e.g., news feed in RSS, NewsML news, Web posted article in HTML or served up dynamically through database query and XSLT transformation, analyst report in PDF or WORD, subscription service with API-based access to Lexis/Nexis, enterprise’s own relational databases and content management systems such as Documentum or Notes, e-mails, etc). Semi-structured data (XML-based data and RDF based metadata) is growing at an explosive rate.

  24. Processing • Semantic query processing with the ability to query both ontology and metadata to retrieve heterogeneous content is highly valuable. Consider “Give me all articles on the competitors of Intel”, where ontology gives information on competitors, supports semantics (with the understanding that “Palm” is a company and that “Palm” and “Palm, Inc.” are the same in this case), and metadata identifies the company to which an article refers, regardless of format of the article. • Analytical applications could require sub-second response time for tens of concurrent complex queries over a large metadata base and ontology, and can benefit from further database research. High performance and highly scalable query processing techniques that deal with more complex representations compared to database schemas and with more explicit roles of relationships, is important. Have not found great use of DL reasoning.

More Related