550 likes | 699 Views
Integrating Text Into an Enterprise IT Environment February 25, 2003. Curt Monash, Ph.D. President Monash Information Services curtmonash@monash.com www.monash.com. Agenda for this talk. How text indexing and search work – and what they assume Fitting text into a traditional IT context
E N D
Integrating Text Into an Enterprise IT EnvironmentFebruary 25, 2003 Curt Monash, Ph.D. President Monash Information Services curtmonash@monash.com www.monash.com
Agenda for this talk • How text indexing and search work – and what they assume • Fitting text into a traditional IT context • Sorting out your text application needs • Key considerations in text application architecture
There are no miracles or magic bullets • “Search engines” aren’t the answer • “Content management” isn’t the answer • Clustering isn’t the answer • XML isn’t the answer No one technology solves all search problems
Gresham’s Law of Coinage Bad (i.e., debased) coinage drives out good
Monash’s Law of Jargon Bad uses of (recently coined) jargon drive out good ones • Example: “Content management” can mean almost anything
Best practices for text apps are the same as for any other major IT challenge • Understand your application needs • Use safely proven technology where you can • Push the boundaries of technology where you must • Ask your users to make small changes in the way they do their jobs
Key takeaways • The classical “technology stack” is evolving nicely to accommodate text • Standalone search-in-a-box doesn’t solve very many problems • Careful application analysis is crucial • It’s not just data design and workflow • Security needs to be designed in
Part 1 How text indexing and search work – and what they assume
Different application contexts • Different kinds of problems • Different available resources
Recall vs. Precision • Recall = What percentage of the valid hits did you get? • Crucial if you actually need 100% • Precision = What percentage of the (top) hits returned really are valid? • Important for user satisfaction and efficiency • But how is “valid” measured???
Three fundamentally different scenarios • Article search • Web search • OLTP application text search
Article search • Very high recall may be needed • Metadata may be reliable • Document style and structure may be predictable This is the “traditional” information retrieval challenge
Successful only in clearcut research markets • Legal – Lexis • Investments • Simple-minded apps • Stock symbols are the perfect keyword • Intelligence community? • Business “competitive intelligence” • Scientific/medical
The “Daily Me” hasn’t arrived yet • How well does the user understand information retrieval? • Who has time to read anyway? • Failures include Newsedge, Northern Light, et al. • “Personalized” portals are wimpy, and nobody seems to care
Web search • Precision is usually a bigger problem than recall (300,000 hits!) • Metadata is unreliable (no standards, deliberate deception) • Style and structure are enormously varied
Users like Google But how are they using it? • What they’re finding is good web sites • They still have to navigate to the specific page
OTLP app text search • 100% precision is assumed for the overall app … • … so text search had better not be the only way to find documents • The relational record probably is the metadata • Hot future area • Usage is creeping up • Functionality is still primitive • App dev tools are improving dramatically, albeit from a dismal starting point
Lessons from Amazon.com • Search-based navigation can work • The user needs a clear understanding of what s/he is looking for • If you make an imprecise query, you have to accept an imprecise result set
It all starts with word search • Big, specialized inverted-list index • Huge but sparse • Analogous to bit-map or star schema • Digrams/trigrams/n-grams, offsets, stopwords • Fortunately, integration into RDBMS has been largely solved
The ranking problem • What does 75% relevance mean? • How do you combine rankings from different subsystems? • The SAME query against the SAME data can give different results in different search engines • The SAME query against the SAME search engine can give different results if you add irrelevant data
Major issues for (key)word search • Ambiguity • Vagueness • Information overload
Major tools • Traditional linguistic techniques • “Automagic” clustering • Traditional metadata • Socioheuristics
Traditional linguistic techniques • Synonyms and other semantic clues • Topic sentences and other syntactic clues • Standard document structure
Query translation/expansion • Thesaurus • End-user extensible • Spelling correction • Traditional (e.g., drop the vowels) • Modern (e.g., compare to query logs)
Automagic clustering and information discovery • Nice mathematical buzzwords • Bayesian statistics, etc. • It all boils down to “distance” measured in a very high-dimensional vector space • Nice social science buzzwords too • Semiotics, etc. • Same appeal as neural networks • The computer “discovers” what humans can’t
Clustering technology isn’t sufficiently advanced yet to be “magic” • Same weaknesses as neural networks too • Lack of reliability • Lack of transparency • Lack of predictability! • Legacy of failure • Search engines: Excite, Northern Light • “Employee Internet Management” (i.e., porn/gambling filter) companies
Traditional metadata • Typically supplied by the author/editor, or by a librarian • Keywords, etc. • Who/What/Where/When
Socioheuristics • Measures of page popularity • Guesses at author expertise
Sorting through the metadata Since unaided “search” often works badly, metadata is crucial
So what is metadata in a search context? • Standard definition of “metadata”: Data about data • Actually, relational metadata usually is data about data structures • But in the text world, metadata usually is data about the data itself
Categories of text metadata • Library-like • Extracted from the document • Implicit in the corpus • OLTP-like
Classical document metadata • Comes from the library tradition (i.e., card catalogs) … • … and/or from early online document stores used by librarians • Examples: • Title, author, date, etc. • Hand-selected classification/categorization • Hand-selected keywords • Can be created by author, editor, “librarian”
Extracted metadata • In essence, precomputed text search • Examples: • Key words (or keywords) and concepts • Titles and metatags • Topic sentences, summaries • Author, etc.
Implicit metadata – location, location, location • Where on the net is the document? • Judge a document by its neighbors • Major problem – unstable net topography • URL patterns can’t be relied on, unfortunately • Google’s original algorithms were based on behavior analysis on the public WWW
Automatic metadata in “traditional” OLTP apps • Examples • Comment fields in apps such as • CRM/call report • Maintenance/damage report • Web feedback forms • Limited more by application imagination than by the data itself
Part 2 Fitting text into a traditional IT context
Benefits of storage in standard DBMS • System management (e.g., backup, failover) • Standard programming languages/APIs • Security!!
Old objections to DBMS-based storage are invalid • Performance -- Proprietary systems can’t index email in real time either • Specialized functionality – the DBMS have long feature lists too
All enterprise data architectures are supported • Central everything • Central index, distributed storage • Distributed/federated everything
Application development technology and tools are just emerging • SQL/MM • Search controls, etc. • Emerging XML-centric technology • Customizable “content management” systems
Canned text apps are a mixed bag • Document management for regulatory filings • Information discovery • Generic search
Part 3 Sorting out your application needs
Different applications have very different profiles • Precision/recall of result • Quality of input • Security
Basic application types, Group 1 – the fuzzies • Portal (e.g., self-service HR) • Best case for generic WWW-like search • Notes/Exchange/Email • Not clear what the real functionality needed is • Active area of research/development • Information discovery
Basic application types, Group 2 – OLTP • Heavy-duty transaction processing (ERP, supply chain, etc.) • Search is tangential • Direct touch CRM • Basic search is underutilized but gaining ground • Online sales/marketing (very different in different industries) • Search part of the app unlikely to be very demanding … • … except from a security standpoint
Basic application types, Group 3 – Heavy-duty analytic aids • BI/CPM/Analytic apps • Great for taming the numerical part of the information tangle • Text search is largely irrelevant • Product lifecycle management (engineering-centric) • Text is an afterthought • Product lifecycle management (regulatory-centric) • Documentum et al. offer “compliance” solutions • Online maintenance manuals • This is a biggie for text!!
Part 4 Key considerations in text application architecture
Five big issues • Database integration • Realistic options for document metadata • Document stylistic consistency (local) • Quality-of-search application requirements • Security
Text database integration vs. relational database integration • Remote indexing is an option • Data cleaning and consistency issues are different • Performance issues are different • Everything is a little more primitive
Document metadata – consider the source • Author/editor – can’t be relied on • Implicit metadata – great if you trust your policies/procedures • Extracted metadata – same strengths/weaknesses as general text search • From a relational OLTP app – nice if you have it