A New Content Processing Framework for Search Applications Iain Fletcher

A New Content Processing Framework for Search Applications Iain Fletcher ifletcher@searchtechnologies.com

Agenda • Briefly About Search Technologies • Key Issues for Enterprise Search • A New Content Processing Framework for Search Applications • How do we use it? • What does it look like? • Use case example

Search Technologies overview • The leading IT services company focused on search engines • Consulting • Implementation • Managed services • Technology independent, working with most of the leading search engines • 90 staff, 250+ customers

Search Technologies overview Ascot, UK Boston, MA Cincinnati, OH Herndon, VA San Diego, CA San Jose, CR

Executive team # years in the search engine industry

Selected customers

A New Content Processing Framework for Search Applications

Enterprise Search - An Indifferent Reputation • Major surveys show that no progress has been made during the last 10 years • Searchers are successful in finding what they seek 50% of the time or less • 2001, IDC, “Quantifying Enterprise Search” • More than half cannot find the information they need using their Enterprise search system • 2011, MindMetre/SmartLogic, “Mind the Enterprise Search Gap”

Search Fundamentals

Metadata Supports Relevance Ranking

Metadata Supports Relevance Ranking Supported by great metadata! • Title • Meta description • URL • Inbound links • Alt tag text • Etc. • Provided for free by millions of SEO practitioners

Key Issues • Almost all modern search functions are driven by data structure

Key Issues • The majority of serious problems in serious search systems are caused by data quality issues Also... • “Big Data” and BI from unstructured data will face the same challenges • Can you trust an analysis if you are unsure of data providence?

Data quality examples • The subscription portal caught out by template information • The Intranet search skewed by a new piece of hardware • The Intranet search where great quality was the problem!

Key Issues • Data structure and quality issues are addressed in the indexing pipelines of search engines • Cleaning, enriching, normalizing, granularizing... • It is about process as much as technology • And data constantly evolves • Sometimes the built-in indexing pipeline is not good enough (issues with scale, flexibility or transparency) • Some search engines don’t really have one • We’ve written our own

Document Processing Methodology for Search (DPMS) • The Philosophy • Understand the Document Model • Understand the User Model • Includes business-level requirements • Create the Search Engine Model • Search = the pivot point between User and Data • Document everything

DPMS – The Methodology Assessment (Search Technologies Architect and Business Analyst) Assessment Report 1 Expert assessment and recommendations Assessment DMDs DPMS Analysis (Knowledge Engineer, Business Analyst, etc.) Review (Architect, Domain Experts, Peers) 2 Detailed Analysis Implementation (Developer) Validate DMDs Aspire Validation 3 Search Engine Execution

DPMS – The Implementation

Introducing “Aspire” • Think of it as a stand-alone indexing pipeline with a framework + component architecture • Framework built for scalability, performance and flexibility – designed to use cloud elasticity • Components built to be autonomous and transparent

Technology Suite • 100% Java • OSGi™ See www.osgi.org • The Dynamic Module System for Java™ • Apache Felix • Open source implementation of OSGi • Jetty • Embedded HTTP server • Maven & Maven Repositories • For component deployment

Component Configuration • Any number of document processing pipelines can be used in an application • Disparate data sources will need different treatment • Components can be shared where appropriate • Configurations are easy to change

Component autonomy • Components communicate via XML • Each component has a known and transparent input and output, and can be tested in isolation • This simplifies problem diagnosis, promotes transparency and controls cost-of-ownership

Data Quality Monitoring • Components have built-in quarantine systems to monitor data quality • Content is constantly evolving • This provides transparency and enables content issues to be diagnosed and resolved faster

The Component Library • Search Technologies maintains a library of components • Currently there are more than 70 • Components can be as simple as 3 lines of groovy script, or complex, 3rd party technologies • Many applications can be addressed using existing components + configuration

Component Upgrading • Components can be upgraded in-situ from a cloud-based service, without stopping/restarting the system • Helpful in the maintenance of complex or mission-critical systems

Component control • Every component has its own control / status page

A very simple example

Security expansion example

Patent Assignee Name Normalization

Complexity example • CPA Global Discover • The world’s leading patent research portal • 80 million patents from 95 patent offices • More than a dozen navigators built • Numerous graphical search results display options • Whole document comparison features

In Summary • Many applications today don’t need this level of diligence • But as data and data dynamism grows, more will • A stand-alone unstructured content processing system can serve multiple applications, and makes sense for some companies • Method. Diligence. Transparency – its not rocket science... • Applying this approach to enterprise search is a key part of moving user satisfaction forward during the next few years

Thank You! Iain Fletcher ifletcher@searchtechnologies.com • http://uk.linkedin.com/in/iainfletcher

A New Content Processing Framework for Search Applications Iain Fletcher

A New Content Processing Framework for Search Applications Iain Fletcher

Presentation Transcript

A Signal-Processing Framework for Inverse Rendering

A unIfied framework for multimodal content SEARCH

Implementing a Faceted Search Framework

ROAD: A New Spatial Object Search Framework for Road Networks

Sailfish: A Framework For Large Scale Data Processing

Processing Framework

A general search for new phenomena

A generic face processing framework: Technologies, Analyses and Applications

A Framework for Mobile Applications

Optimizing XML Processing for Grid Applications Using an Emulation Framework

A Framework for Testing Database Applications

Architecture Content Framework

A New Neural Framework for Visuospatial Processing

A Framework for Composing Pervasive Applications

Search-Strategies for a new professional

RankFP : A Framework for Rank Formulation and Processing

A lightweight framework for testing database applications

TheDataWeb: a New Framework for Data

Content Methodology: A New Model for Content Marketing

FuzzyWorld - a framework for expert applications

A Hierarchical Framework for Content-Based Image Retrieval

TheDataWeb: a New Framework for Data