250 likes | 265 Views
Explore the benefits and drawbacks of latent semantic indexing (LSI) for database searches, using the use case of EBSCO Publishing. Discover how LSI improves search precision, integrates diverse content types, and enables real-time content integration.
E N D
Finding Stuff: -LSI and Database Searching- A Business Use Case Joe Tragert EBSCO PublishingBentley June 26, 2006
Overview • EBSCO Publishing overview • Latent Semantic Indexing pros and cons • Integrated diverse content types – the Executive Daily Brief use case • Discovering obfuscated records – the US PTO example
EBSCO Industries • Ranked #162 in Forbes “America’s Largest Private Companies” in 2005
EBSCO Publishing • Research & reference solutions • Corporate • Medical • Academic • Public Library • K-12 • 73 terabytes of content, configured into over 100 different proprietary full-text databases • Redistribute 100+ 3rd-party reference products • Founded in 1987, 550 employees world wide, HQ in Ipswich, MA
Latent Semantic Indexing • Searching is focused on the words, not indices or metadata. • The engine can be “trained” to optimize results by domain (engineering, medical, general business, etc.) • Engine creates a vector space based upon the data it sees. All articles are placed within that vector space. • Updates are quickly assigned values within the vector space, enabling real-time integration of RSS feeds. • Multiple data sources are integrated rapidly, requiring a few hours to a few days.
LSI Advantages • Conceptual Search: concepts are matched, not key words • Easier to create searches by using chunks of text as search “terms” • No need to understand thesauri or Boolean operators • Integrated Content: databases, blogs, RSS, etc. • Multiple databases can be searched at once (similar to federated search, but different…) • Since the words are searched, no need to normalize indices or record structures of source data sets • Real time content • The engine can rapidly assign new content to the existing vector space, enabling integration of current content with archival material • Language agnostic • Since all content is converted to value in the vector space, multiple languages can be searched and returned in a single result list
LSI Disadvantages • Precision: Matching concepts does not lead to the “one perfect article” • Multiple content types in one result set requires robust filtering and refining functionality, to minimize confusion • Default date order sorting can “overwhelm” a result list • Multiple languages is seductive, but requires quality translator feature to get best utility from the results • Can be difficult for the “Google generation” to grasp the concept of “concepts”
Why Use LSI? • Structured data: users tend not to care about meta data • Currency is king: users tend to focus on “real time” content (news sites, blogs) but periodicals can provide real value • Skills: not everyone is a librarian… actually, most aren’t • Tools: slow to learn, slower to change • Perspective: impatient with complexity
LSI Use Case: • Customizable monitoring and alert service • Supports non-librarian corporate uses: brand management, corporate intelligence, general counsel, IP management, etc. • Two types of Search • Content Analyst LLC’s patented Concept Search™ • EBSCO’s keyword search • Multiple content types • Premium business content (EBSCO structured content) • Newspapers • RSS feeds (blogs, news sites) • Licensed databases (USPTO, INSPEC, etc.) • Intranet repositories
Multiple Content Types and Search Methods • Users can set up folders, and monitor for content related conceptually (same meaning, but different words) to key words or article “examples” already in the folders • Users can search for immediate results that are related to words, articles, emails or external documents, using Concept Search or Key Word Search • Users can link to “advanced” key word search options, thesauri, and visual searching
Folders Are Determined by End Users • Users can add, delete or edit “alerts” (folders) as needed • Users put words, phrases, paragraphs, full articles, emails, MS Word docs, etc. into the folders. • EDB adds matches to the folders • Results for a folder appear when the folder is selected • Users can easily make a result into a “concept” (example) and put it into a folder
Structured Content in Familiar Layout • The full text is viewed in a pop up window • The user will link to the source (the article on EBSCOhost, news site, the RSS feed provider, licensed database or intranet file) • Users can email, save, print the document, or add it to their folder as a new example to be monitored
Linking to RSS Providers Simplifies Access • Selected RSS articles are viewed in a pop up window • The user links to the source
Results Are Refined, Interactively • Users can sort results by Date, Title, Publication and Relevance • Users can narrow results by Publication or Content Type • Users can delete previously read content, content of a specific relevance, or content published before a specific date
Alerts Controlled by End User • Users can set up email lists (groups and individuals) to automatically forward documents • Users can set higher relevancy threshold for shared documents, vs. their own inbox (only send the “best” articles to colleagues
LSI Use Case: • Find deliberately obscured patents • Compare prior art to current research • Monitor pending patents • Search patents in native languages • USPTO • European Patent Organization • Japan Patent Office • Expose patent search to more staff • Bench scientists • Competitive intelligence • Risk managers
Sneak Peak: EBSCO Patent Monitor • In development – Fall 2006 release • Use Concept Searching to identify “conceptually related patents” • Enable cross-database searching • Patents (various sources) • Published STM literature • Proprietary research & intranets
Searching on “motorcycle” finds patents that do not include the term “motorcycle”
Patent #6,085,857 does not contain the word “motorcycle”, but it sure looks like one… aka: “motorcycle”
Running a concept search on the patent abstract creates an ‘instant context list” These terms are found in the USPTO database and relate to “saddle-type riding vehicles.” Users can search the USPTO database to find those patents, or they can research the individuals to see who else is an expert…
The terms and names on the Instant Context list can indicate the true nature of the patent… Shinobu Tsutsumikoshi is a developer at Suzuki...
Search using press release on the new Maxim Knee System and get hundreds of related patents….
US Patent #6,090,144 is about prosthetic knees even though the Maxim press release never used the term “prosthesis”
Finding Stuff: The Dead Mouse Test • LSI, key words, proximity, etc… • The real question is not which mouse trap works better… • …just did we kill the mouse?
Thank You Joe Tragert Director, Market Development EBSCO Publishing O: +800-653-2726 ext. 661 E: jtragert@epnet.com