370 likes | 437 Views
EP118 What are you Searching For?. Dmitry Chernizer Enterprise Systems Architect Dcherniz@sybase.com. In My generation… Concept Search Module Content Search Module Sample Configurations Summary. Agenda. Back in my day, people used to walk to their information. Up Hill… Both Ways!
E N D
EP118What are youSearching For? • Dmitry Chernizer • Enterprise Systems Architect • Dcherniz@sybase.com
In My generation… Concept Search Module Content Search Module Sample Configurations Summary Agenda
Back in my day, people used to walk to their information. Up Hill… Both Ways! Now everything is at you fingertips. www.Research.com www.StockAdvice.com www.NewJob.com Billions of pages of text, HTML, XML Most of them useless and out-dated information It’s not about the 20 docs you have It’s about the 5 pages you need! The e-Volution of Unstructured Data
Hierarchical The e-Volution of Unstructured Data How did we get here? Word Crash! Text Relational PDF HTML
A way to store non-relational (maybe hierarchical) data A standard way to express complex relationships Gather data Process & store it Query & display it Ability to assign a life-cycle to a piece of information Why you ask? Because your brain works that way The e-Volution of Unstructured Data What is knowledge management?
What you need to know… • Less than Einstein • More then this guy.. • Two kinds of Search Engines • Concept Based Search • Content Based Search
What you need to know… Concept Based Search Deals with processing unstructured requests Content Based Search Deal with processing structured requests
Concept Based Search Why and Where does it fit in? Personalization Content Management Continuous Availability Integration Security
The Purpose The Process Bayesian Inference Shannon’s Information Theory Adaptive Probabilistic Concept Modeling Dynamic Reasoning Engine Examples Concept Search Engine Basics Note: The Sybase Concept Search Engine uses embedded technology from Autonomy
Automate process of getting the right information to the right person Improve the efficiency of information retrieval Enable the dynamic personalization of digital content. Natural language content search and retrieval Automatic categorization by an agent Automatic Content Personalization Concept Search Engine BasicsThe Purpose…
Advanced Concept matching techniques High-performance pattern matching algorithms Can analyze a text and identify the key concepts within the document Based on frequency and relationships of terms correlated with meaning Language Independent Concept Search Engine BasicsThe process…
Keyword, Boolean and Proximity Searches: Exacerbate / increase information overload Can’t tell how relevant a document is to subject being researched Only track simple occurrence of keywords ( e.g., "CD AND (NOT (financial OR money OR invest*)) AND music.“ May track proximity of content but not relevant content Lack of localization (English Wizard… hey I’m NOT!) Concept Search Engine BasicsLimitation of other approaches…
Developed by Thomas Bayes, 18th century cleric and mathematician Central tenet of modern statistical probability modeling Calculates probabilistic relationship between multiple variables and the extent to which each impacts the other Used in pattern and fingerprint recognition Bayesian Inference Okay maybe not this guy
Developed by Claude Shannon in 1949 Words which are less frequent across all documents, but appear in a cluster of documents are more distinguishing and tend to convey more information Ideas can be inferred from related content An inference engine may be used to parse and build content Shannon’s Information Theory
Adaptive, Probabilistic Concept Modeling • Bayesian Inference + Shannon’s Information Theory • Dynamic Reasoning Engine (DRE) generates networks of concepts • Terms are weighted; relationships are established • The unstructured content “portal” metaphor
Core Engine of the Concept Search Logic Uses the APCM algorithms to extract content Generates relative weight of document relevance, base summary and/or result set (non-tabular) Generates query plans for unstructured data May be stored as Templates for reusable queries May be used by agent processes for aggregation Accessed thru Enterprise Portal Search API Concept Search Inference EngineAlso known as Dynamic Reasoning Engine…
Automatically gathers text content from local file systems and imports external files into an index Can gather document sets in a local file system Can spider mapped drives Can load a single document as discrete sets Uses Verity, Keyview & Adobe filters, To work on ASCII text Will continually check for new content Auto Indexer
Automated Content Categorizer Stores categories or reusable queries known as “Agents” Agents can be shared or used to find people with similar interests Agent Process
Allows ‘auto- spidering’ of web sites to gather data Converts web content to index able format May be used to Fetch content from many sites simultaneously Can return meta-data and conventional text content Obtains Web Pages behind Firewalls and through Proxy Servers Obtains Web Pages protected by a login Obtains Web Pages using Cookies Knowledge Fetch Process
Auto Indexer HTML E-Mail News Inference Engine PDF The Knowledge Management ServerA portal Service… Sybase Enterprise Portal Open Client IBM DRDA SQL*Net ODBC/JDBC File I/O POP3 Exchange Lotus Notes Application Service Engine HTTP HTTPS HTTP HTTPS Word Back
Encapsulate Search API into a set of EP components Components can be accessed by other EP services, such as security servlets, messaging or other EJBs Allows load balancing across server clusters Secure Search and Profile Locking Allows extending of the Dynamic Reasoning Engine via ANY component model (Java, C, ActiveX, Server Side Java Script, etc.) Enterprise Portal Search Services
EP Data Store Sample Architectures Load Balancing Hardware Firewall Client Web Server Presentation Layer External Spider Agent Concept Search Inference Engine Application Engine Knowledge Server Agents Knowledge Server Internal Spider Agents Fetch Agent Fetch Agent Unstructured Data Repositories Data repository Intranet DMZ Ring Back
Storage Overhead • No content stored, just terms & wts: • ~30 - 50% of original document size • Content stored, plus terms & wts: • ~150% of original document size • Content, proximity & phrase matching, and terms & wts: • ~250% of original document size
Content Based Search Why and Where does it fit in? Personalization Content Management Continuous Availability Integration Security
The Purpose The Process Content Search Basics Full Text Search Specialty Data Store Sample Architectures Content Full Text Search Engine Basics Note: The Sybase Content Full Text Search Engine uses embedded technology from Verity
Structured (SQL) Access to Unstructured Data Adaptive Server (or EP) indexes documents stored in external data stores Indexes are maintained within a collection It understands words and language constructs It understands many document types e.g. MS Word, html, sgml, pdf, etc Content Search Engine BasicsThe Purpose…
HTML EP Data Store PDF Content Search Engine BasicsThe Process… Sybase Enterprise Portal Application Service Engine SQL Query Specialty Data Store Text Word
Queries are issued against a collection Results include a document identifier and a score Score indicates how well a document matched the query Can understand and index many foreign languages Include rules for understanding words and constructs of the specified language Content Search Engine BasicsThe Process…
Queries are issued against a collection Results include a document identifier and a score Score indicates how well a document matched the query Content Search Engine BasicsThe Process… Collection - A Find documents where “blue” is near “red” ID = 68, score=98 ID=17, score = 71
Can understand and index many foreign languages Include rules for understanding words and constructs of the specified language Content Search Engine BasicsThe Process… Hola! Bon Jiorno! Mahalo! Kem-Cho!
Specialty Data Store EP Data Store Content Search Engine BasicsThe Process… Indexed data and index in two separate data stores Indexed Data Indexed Data • Updates, synchronization, backup, recovery?
Data Store propagates source changes to the collection An events table (text_events) is used to log changes to the source tables Data Store must be notified that changes exist Backups of both data stores must be synchronized Full Text Search is a Specialty Data StoreYes but..
Specialty Data Store EP Data Store Full Text Search is a Specialty Data StoreSybase Provides… • Integrated backup and restore facility • Backup / Recover database and text indexes • Online configuration • Configure Full Text Search at runtime dump database...
Enhanced Full Text Search Features • Clustering: a feature for grouping similar documents • Clusters are inherently fuzzy - the algorithm merely attempts to group similar documents • Query By Example: provides ability to search for documents that are similar to one or more segments of text • select summary, score, copy • from t1 t, vt1 v • where t.id = v.id and • index_any = ‘<like> (“Space the final frontier”)’
Custom Thesaurus allows users to build a thesaurus specific to their application. Synonym Maps for proximity search control: 1synonyms:(list: “red, ruby, scarlet, fuchsia, magenta” list: “blue <or> azure ”)$$ A Text Index is used by joining the source table and the index table select score, copy from story_index i, stories s where i.id = s.id and i.score > 70 and i.index_any = “Digital <near> Compaq” Enhanced Full Text Search Features
Sybase provides 2 types of Knowledge Management Concept Search Content Search Technology Futures include an unstructured data server, XML search and indexing, XSL translation and other ways of managing hybrid data. Summary
Yes it can be done Content, Concept We have it all Summary