UIMA and Semantic Search Introductory Overview ibm ...

1. UIMA and Semantic Search Introductory Overviewwww.ibm.com/research/uima David A. Ferrucci Senior Manager, Semantic Analysis and Integration Chief Architect, UIMA IBM T.J. Watson Research Center WF � Data, Scale, High-Volume Crawls, greater and greater Throughput UIMA � Enabling an accelerating ecosystem to creating reusable analytics and applicationsWF � Data, Scale, High-Volume Crawls, greater and greater Throughput UIMA � Enabling an accelerating ecosystem to creating reusable analytics and applications

2. 2 Billboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hitsBillboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hits

3. 3

4. 4

5. 5

6. 6 Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall Billboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hitsBillboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hits

7. 7 Billboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hitsBillboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hits

8. UIMA � Quick Overview Architecture, Software Framework and Tooling Enabling Semantic Analysis The Foundation of Semantic Search

9. 9 UIMA provides a development framework for building, describing & integrating component analytics. UIMA provides a development framework for building, describing & integrating component analytics.

10. 10 Language, Speaker Identifiers Tokenizers Classifiers Part of Speech Detectors Document Structure Detectors Parsers, Translators Named-Entity Detectors Face Recognizers Relationship Detectors Modality Human Language Domain of Interest Source: Style and Format Input/Output Semantics Privacy/Security Precision/Recall Tradeoffs Performance/Precision Tradeoffs... The right analysis for the job will likely be a best-of-breed combination integrating capabilities across many dimensions.

11. 11 UIMA�s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. Another way of looking at the annotation process in terms of the structure that is built-up by each step in the flow. This animation helps illustrate how annotators iterate over annotations, infer new annotations and add them to a common data structure we call the Common Analysis Structure or CAS. A Parser, for example looks at tokens and infers and records grammatical structure annotations like <click> �Noun Phrase�, �Verb Phrase� and �Prepositional Phrase.� This are added to the CAS as stand-off annotations. Stand-off annotations are data structures that are not embedded in the original document, like inline XML tags, but rather point into the document indicating the span of text which they label. As is illustrated in this picture. (The community has established the benefits of stand-off annotations structures prior to our work on UIMA, -- they do not corrupt original, different interpretations, overlapping spans for example). Next <click> a Named Entity detector may get access to the CAS. It would consider the grammatical structure annotations, as well as perhaps the tokens, to infer and record named-entity annotations <click> like �Government Title�, �Person�, �Government Official�, �Country� etc.� A relationship annotator might iterate over named entities to infer and record relationships between them, <click>, like the �Located In� or what we often call the �at� relation. This is itself an object in the CAS that has properties linking the relation annotation to its argument annotations, the entity and the location.Another way of looking at the annotation process in terms of the structure that is built-up by each step in the flow. This animation helps illustrate how annotators iterate over annotations, infer new annotations and add them to a common data structure we call the Common Analysis Structure or CAS. A Parser, for example looks at tokens and infers and records grammatical structure annotations like <click> �Noun Phrase�, �Verb Phrase� and �Prepositional Phrase.� This are added to the CAS as stand-off annotations. Stand-off annotations are data structures that are not embedded in the original document, like inline XML tags, but rather point into the document indicating the span of text which they label. As is illustrated in this picture. (The community has established the benefits of stand-off annotations structures prior to our work on UIMA, -- they do not corrupt original, different interpretations, overlapping spans for example). Next <click> a Named Entity detector may get access to the CAS. It would consider the grammatical structure annotations, as well as perhaps the tokens, to infer and record named-entity annotations <click> like �Government Title�, �Person�, �Government Official�, �Country� etc.� A relationship annotator might iterate over named entities to infer and record relationships between them, <click>, like the �Located In� or what we often call the �at� relation. This is itself an object in the CAS that has properties linking the relation annotation to its argument annotations, the entity and the location.

12. 12

13. 13 UIMA: Unstructured Information Management Architecture Open Software Architecture and Emerging Standard Platform independent standard for interoperable text and multi-modal analytics Under Development: UIMA Standards Technical Committee Initiated under OASIS http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima Software Framework Implementation SDK Available on IBM Alphaworks http://www.alphaworks.ibm.com/tech/uima Tools, Utilities, Runtime, Extensive Documentation Creation, Integration, Discovery, Deployment of analytics Java, C++, Perl, Python (others possible) Supports co-located and service-oriented deployments (eg., SOAP) x-Language High-Performances APIs to common data structure (CAS) Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2) Apache UIMA open-source project http://incubator.apache.org/uima/

14. 14 Consider a straight-forward processing flow -- A simple application that acquires documents form a collection, analyzes them, say first for tokens and then for named-entities, and indexes the results to support search. Now consider that we want to beef this application up -- our colleagues have developed some neat technologies, in particular a relation detector and some co-reference technology that can determine if two different mentions (or strings) in a document likely refer to the same individual or entity. <click> In addition to these analysis capabilities, some other bright folks have developed something that extracts relations over entities and populates a relation database <click> and this should be used to support efficient DB search for all the facts discovered about a particular entity. <click> So you want to add all this stuff to your original application. A design point for UIMA was to make it possible and easy to plug in all these sorts of things. The idea is that the application developer can use the UIMA middleware to incrementally extend an application�s analysis capability at the per-document analysis level BUT also at the collection level where we build different types of structured resources representing inferences over an entire collection like, for example, search engine indices and databases. Consider a straight-forward processing flow -- A simple application that acquires documents form a collection, analyzes them, say first for tokens and then for named-entities, and indexes the results to support search. Now consider that we want to beef this application up -- our colleagues have developed some neat technologies, in particular a relation detector and some co-reference technology that can determine if two different mentions (or strings) in a document likely refer to the same individual or entity. <click> In addition to these analysis capabilities, some other bright folks have developed something that extracts relations over entities and populates a relation database <click> and this should be used to support efficient DB search for all the facts discovered about a particular entity. <click> So you want to add all this stuff to your original application. A design point for UIMA was to make it possible and easy to plug in all these sorts of things. The idea is that the application developer can use the UIMA middleware to incrementally extend an application�s analysis capability at the per-document analysis level BUT also at the collection level where we build different types of structured resources representing inferences over an entire collection like, for example, search engine indices and databases.

15. 15 UIMA Component Architecture The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Reader to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Reader to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses

16. 16 CAS Multiplier: A generalization of the Collection Reader The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses

17. 17 Example: Inline Segmenter The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine. Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine. <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine. <click> This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses

18. 18

19. 19 UIMA Component Repository at CMU http://uima.lti.cs.cmu.edu/index.html

20. Semantic Search Overview Using UIMA to Advance Search and Discovery

21. 21

22. 22 UIMA Pipeline for Keyword Search

23. 23 UIMA Pipe Line for Semantic Search martin <person> martin </person> <organization> martin </organization> <owner> <person> </person> <organization> martin </organization> </owner> +<disease> </disease> <facility> </facility> martin <person> martin </person> <organization> martin </organization> <owner> <person> </person> <organization> martin </organization> </owner> +<disease> </disease> <facility> </facility>

24. 24 We index and search over tokens AND the semantics annotations martin <person> martin </person> <organization> martin </organization> <owner> <person> </person> <organization> martin </organization> </owner> +<disease> </disease> <facility> </facility> martin <person> martin </person> <organization> martin </organization> <owner> <person> </person> <organization> martin </organization> </owner> +<disease> </disease> <facility> </facility>

25. 25 Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall Billboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hitsBillboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hits

26. 26 Billboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hitsBillboard 27M hits on yahoo Billboard outdoor sign 206K hits on yahoo Sign 1 B hits on Yahoo Sign + Billboard = 5M hits on Yahoo Outdoor + sign = 54M hits

27. 27 Concluding Remarks Raising the Search Bar Known-Item Search must evolve into Knowledge Gathering and Synthesis Semantic Search can improve precision and recall Graceful degradation: Worst-Case should be keyword search Semantics Analysis is Key Perfect, consistent or massive manual semantic annotation NOT likely Automated annotation is essential Many annotators must emerge Must be easy to discover, combine, aggregate and deploy UIMA is an enabling Integration Platform Approximate Semantics Universal semantic consensus won�t happen But approximations can work to better search applications Improve precision, recall and density across artifact boundaries

UIMA and Semantic Search Introductory Overview ibm ...

UIMA and Semantic Search Introductory Overview ibm ...

Presentation Transcript

Overview of IBM India and IBM Research

Conceptual foundations for semantic mapping and semantic search

Semantic Markup and Search Engine Optimization

Semantic Markup and Search Engine Optimization

IBM FlashSystem Overview

Indexing and Retrieval Semantic Search

UIMA Introduction

UIMA

Semantic Search

SEMANTIC SEARCH

IBM Software Overview

Semantic Web Search

Introductory Overview

UIMA

UIMA Overview

Semantic Search

IBM DS5000 Overview

Text Analytics on UIMA and UIMA Semantic Search Engine

Combining GATE and UIMA

An Introductory Overview

Semantic Search

Conceptual foundations for semantic mapping and semantic search