340 likes | 411 Views
A New Content Processing Framework for Search Applications Iain Fletcher ifletcher@searchtechnologies.com. Agenda. Briefly About Search Technologies Key Issues for Enterprise Search A New Content Processing Framework for Search Applications How do we use it? What does it look like?
E N D
A New Content Processing Framework for Search Applications Iain Fletcher ifletcher@searchtechnologies.com
Agenda • Briefly About Search Technologies • Key Issues for Enterprise Search • A New Content Processing Framework for Search Applications • How do we use it? • What does it look like? • Use case example
Search Technologies overview • The leading IT services company focused on search engines • Consulting • Implementation • Managed services • Technology independent, working with most of the leading search engines • 90 staff, 250+ customers
Search Technologies overview Ascot, UK Boston, MA Cincinnati, OH Herndon, VA San Diego, CA San Jose, CR
Executive team # years in the search engine industry
Agenda • Briefly About Search Technologies • Key Issues for Enterprise Search • A New Content Processing Framework for Search Applications • How do we use it? • What does it look like? • Use case example
Enterprise Search - An Indifferent Reputation • Major surveys show that no progress has been made during the last 10 years • Searchers are successful in finding what they seek 50% of the time or less • 2001, IDC, “Quantifying Enterprise Search” • More than half cannot find the information they need using their Enterprise search system • 2011, MindMetre/SmartLogic, “Mind the Enterprise Search Gap”
Metadata Supports Relevance Ranking Supported by great metadata! • Title • Meta description • URL • Inbound links • Alt tag text • Etc. • Provided for free by millions of SEO practitioners
Key Issues • Almost all modern search functions are driven by data structure
Key Issues • The majority of serious problems in serious search systems are caused by data quality issues Also... • “Big Data” and BI from unstructured data will face the same challenges • Can you trust an analysis if you are unsure of data providence?
Data quality examples • The subscription portal caught out by template information • The Intranet search skewed by a new piece of hardware • The Intranet search where great quality was the problem!
Key Issues • Data structure and quality issues are addressed in the indexing pipelines of search engines • Cleaning, enriching, normalizing, granularizing... • It is about process as much as technology • And data constantly evolves • Sometimes the built-in indexing pipeline is not good enough (issues with scale, flexibility or transparency) • Some search engines don’t really have one • We’ve written our own
Agenda • Briefly About Search Technologies • Key Issues for Enterprise Search • A New Content Processing Framework for Search Applications • How do we use it? • What does it look like? • Use case example
Document Processing Methodology for Search (DPMS) • The Philosophy • Understand the Document Model • Understand the User Model • Includes business-level requirements • Create the Search Engine Model • Search = the pivot point between User and Data • Document everything
DPMS – The Methodology Assessment (Search Technologies Architect and Business Analyst) Assessment Report 1 Expert assessment and recommendations Assessment DMDs DPMS Analysis (Knowledge Engineer, Business Analyst, etc.) Review (Architect, Domain Experts, Peers) 2 Detailed Analysis Implementation (Developer) Validate DMDs Aspire Validation 3 Search Engine Execution
Introducing “Aspire” • Think of it as a stand-alone indexing pipeline with a framework + component architecture • Framework built for scalability, performance and flexibility – designed to use cloud elasticity • Components built to be autonomous and transparent
Technology Suite • 100% Java • OSGi™ See www.osgi.org • The Dynamic Module System for Java™ • Apache Felix • Open source implementation of OSGi • Jetty • Embedded HTTP server • Maven & Maven Repositories • For component deployment
Component Configuration • Any number of document processing pipelines can be used in an application • Disparate data sources will need different treatment • Components can be shared where appropriate • Configurations are easy to change
Component autonomy • Components communicate via XML • Each component has a known and transparent input and output, and can be tested in isolation • This simplifies problem diagnosis, promotes transparency and controls cost-of-ownership
Data Quality Monitoring • Components have built-in quarantine systems to monitor data quality • Content is constantly evolving • This provides transparency and enables content issues to be diagnosed and resolved faster
The Component Library • Search Technologies maintains a library of components • Currently there are more than 70 • Components can be as simple as 3 lines of groovy script, or complex, 3rd party technologies • Many applications can be addressed using existing components + configuration
Component Upgrading • Components can be upgraded in-situ from a cloud-based service, without stopping/restarting the system • Helpful in the maintenance of complex or mission-critical systems
Component control • Every component has its own control / status page
Complexity example • CPA Global Discover • The world’s leading patent research portal • 80 million patents from 95 patent offices • More than a dozen navigators built • Numerous graphical search results display options • Whole document comparison features
In Summary • Many applications today don’t need this level of diligence • But as data and data dynamism grows, more will • A stand-alone unstructured content processing system can serve multiple applications, and makes sense for some companies • Method. Diligence. Transparency – its not rocket science... • Applying this approach to enterprise search is a key part of moving user satisfaction forward during the next few years
Thank You! Iain Fletcher ifletcher@searchtechnologies.com • http://uk.linkedin.com/in/iainfletcher