260 likes | 348 Views
The OKKAM project. the quest for a web of uniquely identified entities Stefano Bocconi. OKKAM. Introduction. What is OKKAM?. 30 months IP European Project ( http://fp7.okkam.org/ ), started 01/01/2008
E N D
The OKKAM project the quest for a web of uniquely identified entities Stefano Bocconi
OKKAM Introduction
What is OKKAM? • 30 months IP European Project (http://fp7.okkam.org/), started 01/01/2008 • “Enable the Web of Entities, a global digital space for publishing and managing information about entities, where every entity is uniquely identified, and links between entities can be explicitly specified and exploited in a variety of scenarios.” • Entity Name Server (like DNS), one resource -> one ID
The 3 Pillars • Infrastructure • Distributed, large-scale repository • Matching and ranking algorithms, entity lifecycle • Privacy and Security • Okkamized Content • “Okkamizers” (services) and OKKAM-empowered tools • Entity Centric Applications • Authoring tools • Search engine • Product-centered knowledge management solution
OKKAM Use Case
Entity Centric Authoring Environment • Editor (e.g. Word) with an OKKAM plug-in • Entities are recognized in documents, giving the possibility to provide additional information • Fields of application: • FEBS Letters, journal of molecular biosciences, focus on proteins and their interactions • ANSA, Italian news agency, focus on people, events, political parties, places, etc. • Scientific papers, automatic references
Partners • ANSA • News production, authoring and distribution • Content provider • University of Trento • Metadata extractions • Semantic Web technologies • Elsevier • Entity-centric publishing as a future! • Text mining, linking experiments ongoing • Expert System SPA • Semantic intelligence solution provider • Natural Language Processing
Entity Centric Authoring Environment Natural Language Processing • Determining something is an entity • Providing context info to query the OKKAM repository • Information integration • From external sources via the OKKAM id • Creation of new OKKAM ids • Updating profile information • Architecture web-service based to reuse functionality
OKKAM Theoretical issues
Relevant areas • Identity management • OKKAM is horizontal, not vertical. Integration of existing ID systems (DOI, OpenID) • Entity Identity • Data-level & schema level matching • Adaptation • Large scale repository management • Queries, ranking • Information Integration & Grounding of the Semantic Web • Models of security, privacy and trust • Some info private to third parties
Entities • Individuals, particulars, instances • Products, organizations, associations, countries, events, publications, hotels, people • Fictional objects (e.g. Pegasus), from the past (e.g. Plato), abstract (e.g. the Gödel Theorem) • No universal objects, like classes or properties • “forcing” the use of the same URIs for logical resourcesis in principle likely to fail, as people tend to have different views even about the same domain • No fixed schema to store info (loss of generality)
Open issues about entities • ANSA case: event “Microsoft acquires Yahoo!” • I need to retrieve exactly that, and compare the same news from Reuters • Is it an entity? Or a combination of entities? • Separation between entities and knowledge about entities • Do we want to say something about acquisition as a class? • Any class is an instance at some conceptualization level (and vice versa)?
Open issues/thoughts • Can there be such a thing as a private entity? • Trust, authority, the SW never cared • TF-IDF under the cap…Sweeping the problem under the rug? • No enforcement of a schema or hierarchy, BUT good P&R and distributed databases
Questions? Thank you!!
OKKAM Related Work
Connections • Envisioned collaborations • The Large Knowledge Collider (platform for massive distributed incomplete reasoning) • What would you need to use it? • Need for particular entities to be modelled? • Can your research (potentially) contribute to OKKAM? • Do you see potentials/pitfalls?
Online sources • Online articles databases • Science Direct • PubMed is a service of the U.S. National Library of Medicine that includes over 17 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s • MEDLINE source of life sciences and biomedical bibliographic information, with nearly eleven million records (not public) • FEBS Letters (not public) • Databases of proteins • MINT, the Molecular INTeraction database • UniProt (Universal Protein Resource) catalog of information on proteins • Molecular Interaction (MI) • Controlled vocabularies • EMTREE Elsevier’s Life Science Thesaurus. It is a hierarchically structured, controlled vocabulary, for Biomedicine and related Life Sciences.
Related initiatives: Sources • DBpedia: 218 million RDF Triples, 2,180,000 “things”, including at least 80,000 persons, 293,000 places, 62,000 music albums, 36,000 films extracted from English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia (02/08). • Freebase is an open database of the world’s information.It covers millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC, it contains structured information on many popular topics, like movies, music, people and locations—all reconciled and freely available via an open API. • OpenCyc
Related initiatives: Parsers • Zemanta is a tool for bloggers that parse the text and recognize name entities from Wikipedia, IMDB,Amazon, youtube,maps, suggests tags, suggest pictures from CC, Flikr, Getty,and related articles from media and the blogosphere • Open Calais: a Web Service that using natural language processing, machine learning and other methods, analyzes your document and finds the entities within it, the facts and events hidden within your text as well. The web service is free for commercial and non-commercial use. Tools are being developed that use the WS functionality. • Powerset: NLP enhanced search engine for Wikipedia • GATE-based web services and tools (KIM, Melita, SHOW, Annotea)
Related initiatives: People IDs • OpenID: open platform for cross-site indentification, with different free providers. Username is an URL pointing at the provider • Xing, LinkedIN: business social connections • Wink: people search
Related initiatives: Editors • Tabulator: Firefox extension to edit RDF data online with completion
OKKAM Feedback
Strong Points • Very clear and understandable presentation, well presented, lot of discussion • Good question answering: listen to questions, appropriate answers: good! Very good talk, stimulates discussion • Good presentation organization • Interesting presentation, well explained. Good interaction will audience. Slides about entities and issues interesting!
Weak Points • Not clear what timing/scope of the project is very ambitious project! • What about decentralized & autonomous principles of the Web? • Did not mention other systems that tag for examples Web pages based on ontologies, like GATE-based web services and tools (KIM, Melita, SHOW, Annotea..) • Introduction about Web, IDs, ontologies was too vague for people not familiar with these issues • Too much of a “sales” talk. After 15 min still no in depth problems/solutions: only arguments of use and OKKAM specific overview. I would like to know more insight in how to solve the problem since we all understand the problem very well.
Suggestions • Some info on the current status/starting date would be nice • Before “Research areas” add a figure to explain the mapping performed (one id->resource), would allow easier comparison with DNS systems. • The architecture looks to be centralized. Why not using a totally distributed one instead? There exist some P2P DNS systems you could take inspiration from. • Skip “Research areas” slide in such a short presentation. The goal is clear, focus on your solution and mention the problem from “research areas”, when they are applicable • Don’t’ go into implementation details. Focus on the high level concepts, methods and solutions. The problem and solutions are also valid outside OKKAM: talk about these. • General: Good presentation, but talk more about your work and issues instead of about OKKAM in general