Integrating Text Into an Enterprise IT Environment February 25, 2003

Integrating Text Into an Enterprise IT EnvironmentFebruary 25, 2003 Curt Monash, Ph.D. President Monash Information Services curtmonash@monash.com www.monash.com

Agenda for this talk • How text indexing and search work – and what they assume • Fitting text into a traditional IT context • Sorting out your text application needs • Key considerations in text application architecture

There are no miracles or magic bullets • “Search engines” aren’t the answer • “Content management” isn’t the answer • Clustering isn’t the answer • XML isn’t the answer No one technology solves all search problems

Gresham’s Law of Coinage Bad (i.e., debased) coinage drives out good

Monash’s Law of Jargon Bad uses of (recently coined) jargon drive out good ones • Example: “Content management” can mean almost anything

Best practices for text apps are the same as for any other major IT challenge • Understand your application needs • Use safely proven technology where you can • Push the boundaries of technology where you must • Ask your users to make small changes in the way they do their jobs

Key takeaways • The classical “technology stack” is evolving nicely to accommodate text • Standalone search-in-a-box doesn’t solve very many problems • Careful application analysis is crucial • It’s not just data design and workflow • Security needs to be designed in

Part 1 How text indexing and search work – and what they assume

Different application contexts • Different kinds of problems • Different available resources

Recall vs. Precision • Recall = What percentage of the valid hits did you get? • Crucial if you actually need 100% • Precision = What percentage of the (top) hits returned really are valid? • Important for user satisfaction and efficiency • But how is “valid” measured???

Three fundamentally different scenarios • Article search • Web search • OLTP application text search

Article search • Very high recall may be needed • Metadata may be reliable • Document style and structure may be predictable This is the “traditional” information retrieval challenge

Successful only in clearcut research markets • Legal – Lexis • Investments • Simple-minded apps • Stock symbols are the perfect keyword • Intelligence community? • Business “competitive intelligence” • Scientific/medical

The “Daily Me” hasn’t arrived yet • How well does the user understand information retrieval? • Who has time to read anyway? • Failures include Newsedge, Northern Light, et al. • “Personalized” portals are wimpy, and nobody seems to care

Web search • Precision is usually a bigger problem than recall (300,000 hits!) • Metadata is unreliable (no standards, deliberate deception) • Style and structure are enormously varied

Users like Google But how are they using it? • What they’re finding is good web sites • They still have to navigate to the specific page

OTLP app text search • 100% precision is assumed for the overall app … • … so text search had better not be the only way to find documents • The relational record probably is the metadata • Hot future area • Usage is creeping up • Functionality is still primitive • App dev tools are improving dramatically, albeit from a dismal starting point

Lessons from Amazon.com • Search-based navigation can work • The user needs a clear understanding of what s/he is looking for • If you make an imprecise query, you have to accept an imprecise result set

It all starts with word search • Big, specialized inverted-list index • Huge but sparse • Analogous to bit-map or star schema • Digrams/trigrams/n-grams, offsets, stopwords • Fortunately, integration into RDBMS has been largely solved

The ranking problem • What does 75% relevance mean? • How do you combine rankings from different subsystems? • The SAME query against the SAME data can give different results in different search engines • The SAME query against the SAME search engine can give different results if you add irrelevant data

Major issues for (key)word search • Ambiguity • Vagueness • Information overload

Major tools • Traditional linguistic techniques • “Automagic” clustering • Traditional metadata • Socioheuristics

Traditional linguistic techniques • Synonyms and other semantic clues • Topic sentences and other syntactic clues • Standard document structure

Query translation/expansion • Thesaurus • End-user extensible • Spelling correction • Traditional (e.g., drop the vowels) • Modern (e.g., compare to query logs)

Automagic clustering and information discovery • Nice mathematical buzzwords • Bayesian statistics, etc. • It all boils down to “distance” measured in a very high-dimensional vector space • Nice social science buzzwords too • Semiotics, etc. • Same appeal as neural networks • The computer “discovers” what humans can’t

Clustering technology isn’t sufficiently advanced yet to be “magic” • Same weaknesses as neural networks too • Lack of reliability • Lack of transparency • Lack of predictability! • Legacy of failure • Search engines: Excite, Northern Light • “Employee Internet Management” (i.e., porn/gambling filter) companies

Traditional metadata • Typically supplied by the author/editor, or by a librarian • Keywords, etc. • Who/What/Where/When

Socioheuristics • Measures of page popularity • Guesses at author expertise

Sorting through the metadata Since unaided “search” often works badly, metadata is crucial

So what is metadata in a search context? • Standard definition of “metadata”: Data about data • Actually, relational metadata usually is data about data structures • But in the text world, metadata usually is data about the data itself

Categories of text metadata • Library-like • Extracted from the document • Implicit in the corpus • OLTP-like

Classical document metadata • Comes from the library tradition (i.e., card catalogs) … • … and/or from early online document stores used by librarians • Examples: • Title, author, date, etc. • Hand-selected classification/categorization • Hand-selected keywords • Can be created by author, editor, “librarian”

Extracted metadata • In essence, precomputed text search • Examples: • Key words (or keywords) and concepts • Titles and metatags • Topic sentences, summaries • Author, etc.

Implicit metadata – location, location, location • Where on the net is the document? • Judge a document by its neighbors • Major problem – unstable net topography • URL patterns can’t be relied on, unfortunately • Google’s original algorithms were based on behavior analysis on the public WWW

Automatic metadata in “traditional” OLTP apps • Examples • Comment fields in apps such as • CRM/call report • Maintenance/damage report • Web feedback forms • Limited more by application imagination than by the data itself

Part 2 Fitting text into a traditional IT context

Benefits of storage in standard DBMS • System management (e.g., backup, failover) • Standard programming languages/APIs • Security!!

Old objections to DBMS-based storage are invalid • Performance -- Proprietary systems can’t index email in real time either • Specialized functionality – the DBMS have long feature lists too

All enterprise data architectures are supported • Central everything • Central index, distributed storage • Distributed/federated everything

Application development technology and tools are just emerging • SQL/MM • Search controls, etc. • Emerging XML-centric technology • Customizable “content management” systems

Canned text apps are a mixed bag • Document management for regulatory filings • Information discovery • Generic search

Part 3 Sorting out your application needs

Different applications have very different profiles • Precision/recall of result • Quality of input • Security

Basic application types, Group 1 – the fuzzies • Portal (e.g., self-service HR) • Best case for generic WWW-like search • Notes/Exchange/Email • Not clear what the real functionality needed is • Active area of research/development • Information discovery

Basic application types, Group 2 – OLTP • Heavy-duty transaction processing (ERP, supply chain, etc.) • Search is tangential • Direct touch CRM • Basic search is underutilized but gaining ground • Online sales/marketing (very different in different industries) • Search part of the app unlikely to be very demanding … • … except from a security standpoint

Basic application types, Group 3 – Heavy-duty analytic aids • BI/CPM/Analytic apps • Great for taming the numerical part of the information tangle • Text search is largely irrelevant • Product lifecycle management (engineering-centric) • Text is an afterthought • Product lifecycle management (regulatory-centric) • Documentum et al. offer “compliance” solutions • Online maintenance manuals • This is a biggie for text!!

Part 4 Key considerations in text application architecture

Five big issues • Database integration • Realistic options for document metadata • Document stylistic consistency (local) • Quality-of-search application requirements • Security

Text database integration vs. relational database integration • Remote indexing is an option • Data cleaning and consistency issues are different • Performance issues are different • Everything is a little more primitive

Document metadata – consider the source • Author/editor – can’t be relied on • Implicit metadata – great if you trust your policies/procedures • Extracted metadata – same strengths/weaknesses as general text search • From a relational OLTP app – nice if you have it

Integrating Text Into an Enterprise IT Environment February 25, 2003

Integrating Text Into an Enterprise IT Environment February 25, 2003

Presentation Transcript

In Situ Simulation: Integrating into the Clinical Environment

Integrating Dynamic Languages into Your Enterprise Applications

February 2003

Integrating Toxicology into an Undergraduate Curriculum

INTEGRATING ENVIRONMENT INTO HUMANITARIAN ACTION

An Expedition into Expository Text

Integrating Design and Analysis Into a PDM Environment

Integrating Text Into an Enterprise IT Environment February 25, 2003

ONLINE MARKETING; IT WORKS, NOW WHAT? February 25, 2003

Integrating a Network IDS into an Open Source Cloud Computing Environment

Integrating the Laboratory into the Healthcare Enterprise

INTEGRATING RISK MANAGEMENT INTO THE HOMELAND SECURITY ENTERPRISE

Integrating Sakai CLE into an IT Infrastructure

Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers

Integrating LibData into Enterprise Systems:

15.082 and 6.855J February 25, 2003

Integrating BioMedical Text Mining Services into a Distributed Workflow Environment

Integrating Enterprise SEO into Your Current Marcom Strategy

February 2003

Integrating a gender perspective into environment statistics

Integrating a gender perspective into environment statistics

Integrating Design and Analysis Into a PDM Environment