370 likes | 477 Views
Linked Enterprise Data. Leveraging the Semantic Web stack in a corporate environment ISWC 2012 – Boston Fabrice LACROIX – lacroix@antidot.net. Antidot – who we are. French-based Software Vendor Since 1999 | Paris, Lyon, Aix-en-Provence Information access | Data management
E N D
Linked Enterprise Data Leveraging the Semantic Web stackin a corporate environment ISWC 2012 – Boston Fabrice LACROIX – lacroix@antidot.net
Antidot – who we are • French-based Software Vendor • Since 1999 | Paris, Lyon, Aix-en-Provence • Information access | Data management • Mission: Provide our customers with innovative customizable solutions that help them create value with their data, and make their employees more aware and efficient.
Clients Enterprises Publishing E-commerce Healthcare
Unstructured documents • files, ECM, collaborative spaces • intranet, extranet, Web sites • e-mails, instant messaging
Structured data • CRM, ERP, directory • knowledge bases • business applications (production, support)
IS are bloated • 1 practice => 1 need => 1 application => 1 silo • Information system is driven by the process • Data are numerous, various and scattered
Solutions or workarounds? BI MDM SOA Search
Solutions and workarounds • Enterprise Search brings little value to users • Document oriented • Does not solve real business problems Google like Verity like
What we want ERP CRM Production LDAP ECM Support Files
Changing the paradigm • Switching from an application view to a data centric way of thinking.
Bring out the implicit • Build the Giant Enterprise Graph
LED • Linked Enterprise Data application of the Semantic Web technologies and Linked Data principles to the enterprise infrastructure
What works for the Web… • Federating silos on the Web http://www.w3.org/People/Ivan/CorePresentations/RDFTutorial/Slides.html#(102)
…can’t always be used • in corporate IS • Legacy apps can’t be "Sparql’ed" • 80% un- or semi- structured data don’t fit in the model as such • Defining vocabularies/ontologies for silos is too complex and expensive • Don’t want RDF per se but valuable information • External data is available in XML/JSON through Web Services • Staff trained for RDB, XML, Web apps. • No Risk and stability strategy: SemWeb technology considered as new and immature
The RDF/storage approach • Setting up a global RDF repository does not work either • ITs are afraid by the "RDF everywhere" activists
Semantic Web technology still is the right solution in corporate environment BUT it is not an aim JUST use it as a means
Just do it • Think of it as a stream paradigm • build new objects using existing data • without interfering with the existing infrastructure • with SemWeb somewhere under the hood
Enterprise Graph HowTo • Construct the graph • generate triples from data • create triples from documents • Leverage the graph • enrich • infer • Browse the graph • select resources • build objects • Trash the graph
How: extract & normalize • Harvest and normalize • as in an ETL • fetch, clean, transform… • normalize records (names, IDs) to prepare the linking step • For databases • db2triples : an RDB2RDF implementation by Antidot (open source, W3C validated)
How: semantize • Don’t transform everything in RDF • cherry-pick a subset of interesting fields for each object and create their RDF triples counterpart • interesting == needed for linking or inferring Semantize
How: semantize • Triples generation • Be smart: avoid upfront ontology design, use small vocabularies • Be pragmatic: transform XML tags and field names to predicates • Be agile: only insert what you need. And when you need more, add more. • Semantic Web fuels the modeling, linking and information building process
Enterprise Graph HowTo • Construct the graph • generate triples from data • create triples from documents • Leverage the graph • enrich • infer • Browse the graph • select resources • build objects • Trash the graph
How: semantize • Unstructured documents • Extract metadata and transform them as needed to RDF. • Ex: author => dc:creator • Use of text-mining to extract named entities: people, organizations, products… • generate those entities list using the data sources: directory for employees, CRM for companies and people, ERP for products • create triples like doc_URI quotes entity_URI
How: semantize • Unstructured documents • Compare documents using various and dedicated algorithms • is the same • is included • is similar • is related • Generates new triples • create triples like <docA> is_sub_version_of <docB>
Enterprise Graph HowTo • Construct the graph • generate triples from data • create triples from documents • Leverage the graph • enrich • infer • Browse the graph • select resources • build objects • Trash the graph
How: enrich • Enrich the graph • run specific algorithms to generate more links and triples (classifiers, topic detection, …) • insert external data gathered from the LOD or other external datasets or APIs
How: infer • Create new knowledge • add rules according to your needs IF a coworker is quoted in documents AND this coworker belongs to a business unit THEN the business unit is bound to the documents
Enterprise Graph HowTo • Construct the graph • generate triples from data • create triples from documents • Leverage the graph • enrich • infer • Browse the graph • select resources • build objects • Trash the graph
How: build • Build • select resources corresponding to objects seeds (using Sparql queries) • for each seed, follow links smartly in order to create basic objects Build
How: build • Finalize • decorate the new knowledge objects with data set apart (not loaded in the triplestore) • now we have rich user-actionable objects Build Finalize
Enterprise Graph HowTo • Construct the graph • generate triples from data • create triples from documents • Leverage the graph • enrich • infer • Browse the graph • select resources • build objects • Trash the graph
How: expose • Make the new information available to users and to the entire IS Enrich Semantize Harvest Relational DB RDF Triplestore (Linked Data) Normalize Classify Annotate AFS search engine Indexation
Conclusion • It works! • The triples we create and the inference rules we add are dictated by the goal / application • usage and value oriented • We benefit from the lazy-flexible-dynamic modeling of RDF-RDFS-OWL • we are agile • What matters is the graph. But the graph is not the triplestore • storage independent
There’s an app for that • Antidot Information Factory • a software solution designed specificallyto leverage structured and unstructured data • enable large-scale processing of existing data • automate publishing of enriched or newly created information. Harvest Normalize Semantize Enrich Build Expose
The Giant Enterprise Graph • Now we have a path to let SemWeb enter the enterprise
Discuss Understand Learn Exchange www.antidot.net info@antidot.net Thanks for your attention QUESTIONS?