300 likes | 409 Views
OIC-2007 Ontology for the Intelligence Community. Recent European Developments in the Semantic Web. Dr. Mark Greaves Vulcan Inc. markg@vulcan.com (206) 342-2276. Roots of this Talk: Agile Computing meets KR&R. DARPA PM: 5/2001 to 5/2004
E N D
OIC-2007 Ontology for the Intelligence Community Recent European Developments in the Semantic Web Dr. Mark Greaves Vulcan Inc. markg@vulcan.com (206) 342-2276
Roots of this Talk: Agile Computing meets KR&R • DARPA PM: 5/2001 to 5/2004 • A vision of agile computing – a robust distributed infrastructure for dynamic reliable computing • Oriented towards responsiveness rather than prespecified optimality • Provides “illity” and QoS arguments • Supports adaptive, survivable workflows • Leverages local rules over global ones • Is a step beyond interoperability • Agile computing requires goals, plans, and other semantic/intentional notions • Currently with Vulcan (Seattle, WA) • Vulcan (www.vulcan.com) is the corporate vehicle through which Paul Allen manages his assets • Areas include music, movies, sports teams, aerospace, philanthropy, personal tech, energy and greentech, cable TV, venture capital, genetic research, AI… • I am responsible for the AI/KR&R research portfolio, including Project Halo and KR technology for Vulcan Ventures • Work with McDonald-Bradley on IC-related matters Programs and Seedlings Core Technologies
At the End of the 90s: Traditional KR and the Google Property • We seek KR systems that have the “Google Property:”they get (much) better as they get bigger • Google PageRank™ yields better relevance judgments as it indexes more pages • Current KR&R systems have the antithesis of this property • So what are the components of a scalable KR&R system? • Distributed, robust, reliable infrastructure • Multiple linked ontologies and points of view • Single ontologies are feasible only at the program/agency level • Mixture of deep and shallow knowledge repositories • Simulations and procedural knowledge components • “Knowing how” and “knowing that” • Embrace uncertainty, defaults, context, and nonmonotonicity in all components • Uncertainty in the KB – you don’t know what you know, things go away, contradiction is rampant, computing must be resource-aware, surveying the KB is not possible KR&R Goals Ideal KR&R Quality of Answers KR&R now KR&R System Scale (Number of Assertions Number of Ontologies/POVs Number of Rules Linkages to other KBs Reasoning Engine Types …) Scalable KR&R Systems should look just like the Web!! (coupled with great question-answering technology)
The Beginnings of the SemWeb: DARPA’s DAML Program Solution: Augment the web to link machine-readable knowledge to web pages Extend RDF with Description Logic Use a frame-based language design Create the first fully distributed web-scale knowledge base out of networks of hyperlinked facts and data Approach: Design a family of new web languages Basic knowledge representation (OWL) Reasoning (SWRL, OWL/P, OWL/T) Process representation (OWL/S) Build definition and markup tools Link new knowledge to existing web page elements Test design approach in the IC and others Standardize the new web languages Problem: Computers cannot process most of the information stored on web pages Computers require explicit knowledge to reason with web pages Semantic Web (OWL over HTTP) Links via URLs Existing Web (HTML/XML over HTTP) People use implicit knowledge to reason with web pages
$45M over 5 years (FY01 – FY05) DAML Operational Problem No Automatic Semantic Integration of (Intelligence) Data Sources on the Web • Technical Problem • Representation of ontological (type-class-relation) metadata coupled to web data • Agent-based data integration and tractable reasoning across multiple www servers • Early semantic web pilots with various members of the IC • SWIG coordination and data sharing group within the IC SWRL SWRL OWL/Trust Semantic Web: Knowledge Integration Layer Inference Threat Ontology Facility Ontology Geo-Spatial Ontology Sensor Ontology OWL OWL HTML HTML Existing Web: Hand-coded HTML and XML pages XML XML HTML World Wide Web
DAML Program Technical Flow Web OntologyLanguage (OWL) OWL/S: Semantic Web Services SWRL: Rules OWL/P: Proof OWL/T:Trust Completed standards process Started standards process Unfinished DAML Program Elements • Web Ontology Language (OWL) (2/10/04) • Enables knowledge representation and tractable inference in a web standard format • Based on Description Logics and RDF • OWL Reasoning Languages • SWRL and SWRL-FOL: Supports business rules, policies, and linking between distinct OWL ontologies • OWL/P Proof Language: Allows software components to exchange chains of reasoning • OWL/T Trust Language: Represents trust that OWL and SWRL inferences are valid • Semantic Web Services (OWL/S) • Allows discovery, matching, and execution of web services based on action descriptions • Unifies semantic data models (OWL) with process models (Agent) and shows how to dynamically compose web services • OWL Tools • www.semwebcentral.org and www.daml.org Each DAML Program Element includes specifications, software tools, coordination teams, and use cases
Impact #2 #3 Google “darpa”on 10/21/04
The Semantic Web in 2007 Still Research Cutting Edge Mature “The Famous Semantic Web Technology Stack”
The Semantic Web in 2007 Active Research and Standards Activity Commercial Cutting Edge Mature “The Famous Semantic Web Technology Stack”
The Semantic Web in 2007 Active Research and Standards Activity Commercial Cutting Edge Mature “The Famous Semantic Web Technology Stack”
The Semantic Web in 2007 Active Research and Standards Activity Commercial Cutting Edge Mature “The Famous Semantic Web Technology Stack”
Completing the Semantic Web Picture Better Reasoning Systems A Huge Base of RDF data Combined RDF/OWL and RDBMS Systems Active Research and Standards Activity More Ontologies Tag Systems MicroformatsSocial Authorship Commercial Cutting Edge Mature Other Technologies Impact the Semantic Web
Where is the Current US Semantic Web Action? • Some Venture Capital • Vulcan, Crosslink, In-Q-Tel • A modest amount of Federal funding • Interesting corporate developments • Startup: Radar, Metaweb, Evri... • Mature: Yahoo!, Oracle, Lilly... • Focus is mostly Database dimension of Semweb • RDBMS scale and orientation, powerful analytics (= powerful logics and inference engines) • Centralized workflows for ontology definition and management • Use cases surrounding data integration • Emerging microformats and structured blogging (e.g., Twine) • ... But mainly enterprise concerns
Where is the Current European Semantic Web Action? • Follow the money • Currently >€50M/year public funding from the European Commission (Mark’s estimate) • Framework 6 (2002-6) – 17 separate semantics IT programs • Framework 7 (2007-13) – €1B/year for information and communications technologies • Two Dedicated Multi-site R&D Institutes • Semantic Technology Institute International • DERI: 100+ people, major sites in Galway, Innsbruck, Korea • Focus is the Social and Web Dimensions of Semweb • Web-scale, social networks, simple scalable imperfect inference • Ontology and data dynamism, imperfections, versioning • Semantically-boosted collaboration with limited knowledge engineer involvement • A base of socially-curated semantic data • Explicit European vs. US competitiveness theme
Talk Outline: European Work Beyond RDF and OWL • Web-Scale Semantics • Semantic MediaWiki • DBpedia and Linking Open Data • Networked Ontologies (NeOn) • Web-scale Inference • Shallow reasoning and the Large Knowledge Collider Social and Web Dimensions of Semantic Web
Semantic Wikis – The Main Idea • Wikis are tools for Publication and Consensus • MediaWiki (software for Wikipedia, Wikimedia, Wikinews, Wikibooks, etc.) • Most successful Wiki software • High performance: 10K pages/sec served, scalability demonstrated • LAMP web server architecture, GPL license • Publication: simple distributed authoring model • Wikipedia: >2M articles, >180M edits, 750K media files, #8 most popular web site in October • Consensus achieved by global editing and rollback • Fixpoint hypothesis (2:1 discussion/content ratio), consensus is not static • Gardener/admin role for contentious cases • Semantic Wikis apply the wiki idea to basic (typically RDFS) structured information • Authoring includes instances, data types, vocabularies, classes • Natural language text for explanations • Automatic list generation from structured data, basic analytics • Searching replaces category proliferation • Reuse of wiki knowledge Semantic Wiki Hypotheses: (1) Significant interesting non-RDBMS Semantic Data can be collected cheaply (2) Wiki mechanisms can be used to maintain consensus on vocabularies and classes
Semantic MediaWiki • Knowledge Authoring Capabilities (SMW 1.0 plus Halo Extension) • Syntax highlighting when editing a page • Semantic toolbar in edit mode • Displays annotations present on the page that is edited • Allows changing annotation values without locating the annotation in the wiki text • Autocompletion for all instances, properties, categories and templates • Increased expressivity through n-ary relations (available with the SMW 1.0 release)
Semantic MediaWiki • Semantic Navigation Capabilities (SMW 1.0 plus Halo Extension) • GUI-based ontology browser, enables browsing of the wiki's taxonomy and lookup of instance and property information • Linklist in edit mode, enables quick access of pages that are within the context of the page being currently edited • Search input field with autocompletion, to prevent typing errors and give a fast overview of relevant content
Semantic MediaWiki • Knowledge Retrieval Capabilities (SMW 1.0 plus Halo Extension) • Combined text-based and semantic search • Basic reasoning in ask queries with sub-/super-category/-property reasoning and resolution of redirects (equality reasoning) • GUI-based query formulation interface for intuitive assembly and output generation of ASK queries (no SQL/MQL/SPARQL) • Fully open source under GPL • Extensive formal user testing • Download at: http://www.ontoworld.org/wiki/Halo_Extension
Cool Stuff... But Does it Work? Semantic Wiki Hypotheses: (1) Significant interesting non-RDBMS Semantic Data can be collected cheaply (2) Wiki mechanisms can be used to maintain consensus on vocabularies and classes • User tests were performed in Chemistry • 20 graduate students were each paid for 20 hours (over 1 month) to collaborate on semantic annotation for chemistry • ~700 Wikipedia base articles • US high-school AP exams were provided as content guidance • Results • Sparse: 1164 pages (entites), average 5 assertions per entity • 226 Relations (1123 relation-statements) and 281 attributes (4721 attribute-statements) • Many bizarre attributes and relations • Very difficult to use with a reasoner • Ongoing Vulcan-sponsored work on Semantic MediaWiki • Higher-quality authoring: Phase II Halo Wiki extensions done by February • Higher-quality editing: support for Semantic Gardeners (RKF lesson learned) • Very little US-based awareness of these issues, let alone their solutions
DBpedia: Populating the Semantic Web • Mine Wikipedia for assertions • Scrape Wikipedia Factboxes • ~15M triples • High-confidence shallow English parsing • DBpedia dataset • ~2M things, ~100M triples • Classifications via Wikipedia categories and WordNet synsets • One of the largest broad knowledge bases in the world • Simple queries over extracted data • Public SPARQL endpoint • “Sitcoms set in NYC” • “Soccer players from team with stadium with >40000 seats, who were born in a country with more than 10M inhabitants” • We created a Semantic MediaWiki instance augmented by DBpedia data
Linking Open Data • W3C Project primarily carried out in Europe • Goals • Create a single, simple access mechanism for web RDF data • Build a data commons by making open data sources available on the Web as RDF • Set RDF links between data items from different data sources • Total dataset • ~2B triples, and ~3B RDF links • Growing all the time
Networked Ontology Project (NeOn) • Ever try to use 3-4 networked ontologies? • Location and characterization of ontology resources • Version control under multiple revisions • SOA and mapping management • Lifecycle issues • NeOn is an EC Framework 6 Program (2006-2009) • ~€15M, 14 partners including UN FAO, pharmaceutical distribution • Goals: • To create the first ever service-oriented, open infrastructure, and associated methodology • To support the overall development life-cycle of a new generation of large scale, complex, semantic applications • To handle multiple networked ontologies in a particular context, which are highly dynamic and constantly evolving. • Outputs: The open source (GPL) NeOn toolkit: http://www.neon-toolkit.org/
Talk Outline: European Work Beyond RDF and OWL • Web-Scale Semantics • Semantic MediaWiki • DBpedia and Linking Open Data • Networked Ontologies (NeOn) • Web-scale Inference • Shallow reasoning and the Large Knowledge Collider Social and Web Dimensions of Semantic Web
Web-Scale Reasoning: Scalable, Tolerant, and Dynamic The Larger KR Environment: The Evolving Web Blog proliferationMeme propagationIntelligent filteringTrust networksWikipedia Intelligent data aggregation Personalized searchEmail/desktop searchEnterprise intelligence Search Evolution Blogosphere The IntelligentWeb DatalatorPAL/CALO Halo Analytics/Software agents/Reasoning Smart Browsing RSS feeds Furl Pluck Onfolio UCMore A9 (search history) Inference/Intelligent discovery Context extractionData integrationOntoprise SemanticInterconnect Google setsSocial networks Radar NetworksLinkedIn SchemalogicA9 (discovery)Blinkx
Semantics at Web Scale • Semantics are always changing • Per minute, there are: • 100 edits in Wikipedia • 200 tags in del.icio.us • 270 image uploads to flickr • 1100 blog entries • Will the Semantic Web be less dynamic? • There is no “right ontology” • Ontologies are abstractions • Different applications lead to different ontologies • Ontology authors make design choices all the time • Google Base: >100K schemas • Intentionally false material (Spam) • Lesson of the HTML <META> tag. Material from Denny Vrandečić, AIFB
Consequences • Semantic Technologies at Web Scale? • Sindice (www.sindice.com) is now reporting over 3B triples • 20% of 30 billion pages @ 1000 triples per page = 6 trillion triples • 30 billion and 1000 are underestimates, imagine in 6 years from now… • Classical reasoning approaches to Semantic Web will not scale • Examples of current attempts at scaleability • Identify subsets of OWL (OWL-DL, OWL-DLP) • “Reducing the expressive power of a logic does not solve any problems faster; its only effect is to make some problems impossible to state.” – John Sowa • Identify alternative semantics for OWL • e.g. LP-style semantics • Scalability by muscle-power • What do we know about distributed, heuristic, approximate, probabilistic inference? • What do we know about average complexity over web-authored data? • Classical (worst case) complexity is a poor guide for usefulness Gartner (May 2007, G00148725): "By 2012, 70% of public Web pages will have some level of semantic markup, 20% will use more extensive Semantic Web-based ontologies” Material from Frank van Harmelen, Vrije Universiteit Amsterdam
More Consequences • Sloppy Ontologies and the need for sloppy reasoning • OWL has no support for “almost”, “yes, except for a few”, etc. • This was OK, as long as ontologies were well-designed, carefully populated, well maintained over definite problem spaces • Increasingly, ontologies are • Made by non-experts • Made by automatic scraping from file directories, mail folders, todo lists and contact lists • Made by machine learning from examples • Example: “post-doc” ≈ “young-researcher” • Need for any time, any cost answers • Current inference systems are abrupt and expensive • Want to select quality and timeliness • Completeness will be unachievable in practice • Data sources will be partial • Insufficient time to wait for an answer (Courtesy Dieter Fensel)
The Large Knowledge Collider (LarKC) • EC Framework 7 Program • Goals of LarKC • Scaling to infinity • Give up soundness & completeness • Combine reasoning/retrieval and search • Heavy emphasis on probability, decision theory, anytime algorithms • Reasoning pipeline • Plugin architecture, with sampling • Explicit cost models • Public releases of LarKC platform • Public APIs enabling others to develop plug-ins • Encourage participation through Thinking@home • Kind of like SETI@Home • Start in April 2008