E N D
1. UIMA and Semantic Search Introductory Overviewwww.ibm.com/research/uima David A. Ferrucci
Senior Manager, Semantic Analysis and Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center WF – Data, Scale, High-Volume Crawls, greater and greater Throughput
UIMA – Enabling an accelerating ecosystem to creating reusable analytics and applicationsWF – Data, Scale, High-Volume Crawls, greater and greater Throughput
UIMA – Enabling an accelerating ecosystem to creating reusable analytics and applications
2. 2 Billboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hitsBillboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hits
3. 3
4. 4
5. 5
6. 6 Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall Billboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hitsBillboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hits
7. 7 Billboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hitsBillboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hits
8. UIMA – Quick Overview Architecture, Software Framework and Tooling Enabling Semantic Analysis
The Foundation of Semantic Search
9. 9
UIMA provides a development framework for building, describing & integrating component analytics.
UIMA provides a development framework for building, describing & integrating component analytics.
10. 10 Language, Speaker Identifiers
Tokenizers
Classifiers
Part of Speech Detectors
Document Structure Detectors
Parsers, Translators
Named-Entity Detectors
Face Recognizers
Relationship Detectors Modality
Human Language
Domain of Interest
Source: Style and Format
Input/Output Semantics
Privacy/Security
Precision/Recall Tradeoffs
Performance/Precision Tradeoffs... The right analysis for the job will likely be a best-of-breed combination integrating capabilities across many dimensions.
11. 11 UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. Another way of looking at the annotation process in terms of the structure that is built-up by each step in the flow.
This animation helps illustrate how annotators iterate over annotations, infer new annotations and add them to a common data structure we call the Common Analysis Structure or CAS.
A Parser, for example looks at tokens and infers and records grammatical structure annotations like <click> “Noun Phrase”, “Verb Phrase” and “Prepositional Phrase.”
This are added to the CAS as stand-off annotations. Stand-off annotations are data structures that are not embedded in the original document, like inline XML tags, but rather point into the document indicating the span of text which they label. As is illustrated in this picture. (The community has established the benefits of stand-off annotations structures prior to our work on UIMA, -- they do not corrupt original, different interpretations, overlapping spans for example).
Next <click> a Named Entity detector may get access to the CAS. It would consider the grammatical structure annotations, as well as perhaps the tokens, to infer and record named-entity annotations <click>
like “Government Title”, “Person”, “Government Official”, “Country” etc.”
A relationship annotator might iterate over named entities to infer and record relationships between them, <click>,
like the “Located In” or what we often call the “at” relation. This is itself an object in the CAS that has properties linking the relation annotation to its argument annotations, the entity and the location.Another way of looking at the annotation process in terms of the structure that is built-up by each step in the flow.
This animation helps illustrate how annotators iterate over annotations, infer new annotations and add them to a common data structure we call the Common Analysis Structure or CAS.
A Parser, for example looks at tokens and infers and records grammatical structure annotations like <click> “Noun Phrase”, “Verb Phrase” and “Prepositional Phrase.”
This are added to the CAS as stand-off annotations. Stand-off annotations are data structures that are not embedded in the original document, like inline XML tags, but rather point into the document indicating the span of text which they label. As is illustrated in this picture. (The community has established the benefits of stand-off annotations structures prior to our work on UIMA, -- they do not corrupt original, different interpretations, overlapping spans for example).
Next <click> a Named Entity detector may get access to the CAS. It would consider the grammatical structure annotations, as well as perhaps the tokens, to infer and record named-entity annotations <click>
like “Government Title”, “Person”, “Government Official”, “Country” etc.”
A relationship annotator might iterate over named entities to infer and record relationships between them, <click>,
like the “Located In” or what we often call the “at” relation. This is itself an object in the CAS that has properties linking the relation annotation to its argument annotations, the entity and the location.
12. 12
13. 13 UIMA: Unstructured Information Management Architecture Open Software Architecture and Emerging Standard
Platform independent standard for interoperable text and multi-modal analytics
Under Development: UIMA Standards Technical Committee Initiated under OASIS
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima
Software Framework Implementation
SDK Available on IBM Alphaworks
http://www.alphaworks.ibm.com/tech/uima
Tools, Utilities, Runtime, Extensive Documentation
Creation, Integration, Discovery, Deployment of analytics
Java, C++, Perl, Python (others possible)
Supports co-located and service-oriented deployments (eg., SOAP)
x-Language High-Performances APIs to common data structure (CAS)
Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2)
Apache UIMA open-source project
http://incubator.apache.org/uima/
14. 14 Consider a straight-forward processing flow -- A simple application that acquires documents form a collection, analyzes them, say first for tokens and then for named-entities, and indexes the results to support search.
Now consider that we want to beef this application up -- our colleagues have developed some neat technologies, in particular a relation detector and some co-reference technology that can determine if two different mentions (or strings) in a document likely refer to the same individual or entity.
<click>
In addition to these analysis capabilities, some other bright folks have developed something that extracts relations over entities and populates a relation database <click>
and this should be used to support efficient DB search for all the facts discovered about a particular entity.
<click>
So you want to add all this stuff to your original application.
A design point for UIMA was to make it possible and easy to plug in all these sorts of things.
The idea is that the application developer can use the UIMA middleware to incrementally extend an application’s analysis capability at the per-document analysis level BUT also at the collection level where we build different types of structured resources representing inferences over an entire collection like, for example, search engine indices and databases.
Consider a straight-forward processing flow -- A simple application that acquires documents form a collection, analyzes them, say first for tokens and then for named-entities, and indexes the results to support search.
Now consider that we want to beef this application up -- our colleagues have developed some neat technologies, in particular a relation detector and some co-reference technology that can determine if two different mentions (or strings) in a document likely refer to the same individual or entity.
<click>
In addition to these analysis capabilities, some other bright folks have developed something that extracts relations over entities and populates a relation database <click>
and this should be used to support efficient DB search for all the facts discovered about a particular entity.
<click>
So you want to add all this stuff to your original application.
A design point for UIMA was to make it possible and easy to plug in all these sorts of things.
The idea is that the application developer can use the UIMA middleware to incrementally extend an application’s analysis capability at the per-document analysis level BUT also at the collection level where we build different types of structured resources representing inferences over an entire collection like, for example, search engine indices and databases.
15. 15 UIMA Component Architecture The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Reader to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Reader to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
16. 16 CAS Multiplier: A generalization of the Collection Reader The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
17. 17 Example: Inline Segmenter The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine.
Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines.
We start with a simple annotator.
<click>
This gets encapsulated into pluggable container called the analysis engine.
<click>
And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part.
<click>
This resultant capability may be inserted as the core analysis in a collection processing engine.
<click>
This component is an aggregate that includes a Collection Readerm to acquire documents from and external source and CAS Consumers
<click>
to use the results of per-document analysis to build structured resources, like databases and indices.
And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis.
Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling.
=======
For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
18. 18
19. 19 UIMA Component Repository at CMU http://uima.lti.cs.cmu.edu/index.html
20. Semantic Search Overview Using UIMA to Advance Search and Discovery
21. 21
22. 22 UIMA Pipeline for Keyword Search
23. 23 UIMA Pipe Line for Semantic Search martin
<person> martin </person>
<organization> martin </organization>
<owner> <person> </person> <organization> martin </organization> </owner>
+<disease> </disease> <facility> </facility>
martin
<person> martin </person>
<organization> martin </organization>
<owner> <person> </person> <organization> martin </organization> </owner>
+<disease> </disease> <facility> </facility>
24. 24 We index and search over tokens AND the semantics annotations martin
<person> martin </person>
<organization> martin </organization>
<owner> <person> </person> <organization> martin </organization> </owner>
+<disease> </disease> <facility> </facility>
martin
<person> martin </person>
<organization> martin </organization>
<owner> <person> </person> <organization> martin </organization> </owner>
+<disease> </disease> <facility> </facility>
25. 25 Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall Billboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hitsBillboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hits
26. 26 Billboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hitsBillboard 27M hits on yahoo
Billboard outdoor sign 206K hits on yahoo
Sign 1 B hits on Yahoo
Sign + Billboard = 5M hits on Yahoo
Outdoor + sign = 54M hits
27. 27 Concluding Remarks Raising the Search Bar
Known-Item Search must evolve into Knowledge Gathering and Synthesis
Semantic Search can improve precision and recall
Graceful degradation: Worst-Case should be keyword search
Semantics Analysis is Key
Perfect, consistent or massive manual semantic annotation NOT likely
Automated annotation is essential
Many annotators must emerge
Must be easy to discover, combine, aggregate and deploy
UIMA is an enabling Integration Platform
Approximate Semantics
Universal semantic consensus won’t happen
But approximations can work to better search applications
Improve precision, recall and density across artifact boundaries