Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou Research & Development

Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou Research & Development Thomson Legal & Regulatory – West Group St. Paul, Minnesota 55123 USA {Jack.Conrad,Peter.Jackson}@WestGroup.com

Outline Background to vocabulary, incl. that used in title • Terminology • Overview • Research Contributions (Novelty of Investigation) • Corpora Statistics • Experimental Set-up • Phase 1: Actual Physical Resources • Phase 2: Acquired Logical Resources • Performance Evaluation • Conclusions • Future Work Of our operational environment and overall problem space Aspects of the problem that haven’t been explored before, esp. wrt scale & prod. sys. We’ll look at the data sets used, namely, those listed for the next item We’ll compare the effectiveness of each approach on each data set I’ll share what conclusions we’re able to draw and discuss new directions this work may be taking 28th International VLDB '02 — J. Conrad

Terminology • Database Selection • Given O(10K) DBs composed of textual documents, need to effectively & efficiently aid users to narrow info search • Actual Physical Resources • Exist O(1K) underlying physical DBs that can be leveraged to reduce the dimensionality of problem • Have access to complete term distributions asso. w/ these DBs • Acquired Logical Resources • Can re-architect underlying DBs along domain- and user-centric content-types (e.g., Region, Topic, Doc-type, etc.) • Then profile those DBs using random or query-based sampling And hone in on the most relevant materials available in the system Organized around internal criteria such as pub. year, h/w system, etc. Wanted to convince ourselves that we could first get reasonable results at this level Can characterize “logical” DBs using diff. sampling techniques 28th International VLDB '02 — J. Conrad

Overview Overview of Westlaw’s operational environment • Operational Environment • Over 15,000 databases consisting of 1,000s of docs • Over one million U.S. attorneys • Thousands of others in the UK, Canada, Australia, … • O(100K) qrys submitted to Westlaw system each day • Motivations for Re-architecting System • Showcasing 1000s of DBs typically a competitive advantage • Segment of today’s users prefer global search environments • Simplified activity of narrowing scope for online research • User & domain-centric rather than hardware or maint.-centric • Primarily concentrating on areas of law and business • Toolkit approach to DBs and DB Selection tools • Diverse mechanisms for focusing on relevant information Several 100,000 qrys are submitted … We require our users to submit a DB ID Each mech. optimized on a particular level of granularity 28th International VLDB '02 — J. Conrad

Contributions of Research • Represent O(10,000) DBs • DBs can contain O(100,000) documents • Collection sizes vary by several magnitudes • Documents can appear in > 1 DB • DBs cumulatively in TB, not GB range • Docs represent real, not simulated domain • Implemented in actual prod. environment Work reported here involves between 2 and 3 TB 28th International VLDB '02 — J. Conrad

Case_Law (2002/03) Statutes (2002/04) WestNews (2002/05) Analytical (2002/06) Regulatory (2002/07) Westlaw Architectural IssuesPhysical vs. “Logical” Databases O (1000) O (100) Order Mag. Diff. Fed. Rather than categories of data that made sense to system users (in the legal domain), categories such as legal jurisdiction (region), legal practice area, or document-type (e.g., congressional leg., treatises, jury verdicts, etc) Traditionally, data for the Westlaw System were phys. stored in silos that were dictated by internal considerations, that is, those that facilitated storage and maintenance (publ. year, aggregate content type, or source) . . . State Doc-Type Local The 3 cols labeled red rep. the 3 prim. bases for segment Legal Practice Area Int’l This is what our primary objective was in re-arch the WL repository to achieve our logical data sets The rows labeled in blue are … residual sub-groupings resulting from this strategy Jurisdiction 28th International VLDB '02 — J. Conrad

Corpora Statistics ( 40% of WL) (  90% of WL) Callan found 300 docs sufficed Each doc partic. in profile Via sampling Is basically entire dict. Roughly 25% & 50% of the complete dict. 28th International VLDB '02 — J. Conrad

Scoring: CORI 1-2-3 tf-idf based representing df-icf absent terms given default belief prob. Engine: WIN Bayesian Inference Network Data: Collection Profiles Complete Term Distr. (Phase 1) Random & Query-based sample Term Distr. (Phase 2) Scoring: Language Model occurrence based via df + cf smoothing techniques used on absent terms Engine: Statistical Term / Concept Probabilities Data: Collection Profiles Complete Term Distr. (Phase 1) Random & Query-based sample Term Distr. (Phase 2) Alternative Scoring Models 28th International VLDB '02 — J. Conrad

tf * idf Scoring — Cori_Net3 The belief p(wi|cj) in collection cj due to observing term wi is determined by db + (1 – db) * T * I Where db is the minimum belief component when term wi occurs in collection cj • Similar to Cori_Net2 but normalized w/o layered variables Typically this tf type expr is normalized by df_max, but here we introduce K which has been inspired by exps in doc retrieval # our K is different than anything Callan or others have used # they have a set of parameters that are successively wrapped around each other ( ) … This is the collection retrieval equivalent to normalized inverse doc freq (or idf) 28th International VLDB '02 — J. Conrad

Language Modeling  Is of course between 0 and 1 LM based only on a profile doc may face sparse data problems when the prob. of a word, w, given a profile ‘doc’ is 0 (unobserved event) • Weighted Sum Approach (Additive Model) So it may be useful to extend the original document model with a db model An additive model can help by leveraging extra evidence from the complete collection of profiles • Query Treated as a Sequence of Terms • (Independent Events) By summing in the contribution of a word at the db level, can mitigate uncertainty asso. w/ sparse data in non-add. model By treating qry as sequence of terms, w/ each term viewed as a separate event, and the qry rep. the joined event (permits dup. terms and phrasal expr.) 28th International VLDB '02 — J. Conrad

Test Queries andRelevance Judgments • Actual user submissions to DBS application • Phase 1 (Physical Collections): 250 queries • Mean Length: 8.0 terms • Phase 2 (Logical Collections): 100 queries • Mean Length: 8.8 terms • Complete Relevance Judgments • Provided by domain experts before experiments run • Followed training exercises to establish consistency • Mean Positive Relevance Judgments per Query • Phase 1 (Physical Collections): 17.0 • Phase 2 (Logical Collections): 9.1 Why did we use a diff. qry set for Phase 2? Wanted qrys that were less general, more specific, with fewer positive rel. jdgmts per qry 28th International VLDB '02 — J. Conrad

Retrieval Experiments It’s important to point out that our initial exps were at the … • Database-level • TestParameters: • 100 physical DBs vs. 128 logical DBs • For logical DB profiles: Query-based vs. Random sampling • phrasal concepts vs. terms only • stemming vs. no stemming • scaling vs. none (i.e., global freq reduction) • minimum term frequency thresholds • Performance Metrics: • Standard Precision at 11-point Recall • Precision at N-database cut-offs Some of the variables we examined are indicated here Qrys with … versus … Stemmed terms vs. unstemmed terms Inspired by speech recogn experiments – noise We’ll see some examples of these next 28th International VLDB '02 — J. Conrad

Essentially represents the best from both methods for this Phase And we see LM clearly outperforms CORI by > 10% at the first recall points Performance avg-ed over 250 qrys Result consistent with recent results in the doc. retrieval domain

When we move to the logical collections, we see a reversal in this relative performance Incl. the baseline in this case because it’s rel. closer to that of the two techs. Avg. prec. of the two may be sim., but CORI sign. better than other LM results here (Rand_1000 and QBS 500+1000)

The final plot to be exhibited Here we explore a special post-process lexical analysis of queries for jurisdictionally relevant content I.e., when no such context is found, jurisdictionally biased collections are down weighted For results marked Lex, process applied only to qrys w/ no juris. clues For results marked Lex+, apply the reranking to all qrys, but leave the dbs that match the lexical clues in their orig. ranks

Performance Evaluation • WIN using CORI scoring • Works better for Logical collections than Physical collections • Best results from random sampled DBs • Language Modeling with basic smoothing • Performs best for Physical collections; less well for Logical • Top results from random sampled DBs Results don’t agree w/ Callan’s, but he was operating in a non-cooperating env. • Jurisdictional Lexical Analysis contributes > 10% to average precision And as we saw, adding our post-process lexical analysis, precision increased by over 10% at the top recall points 28th International VLDB '02 — J. Conrad

Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs, then evaluated the top 20 documents (2,500 docs total) – this is what resulted Document-level Relevance “On Point” cat. surpasses the next three cats combined 28th International VLDB '02 — J. Conrad

Conclusions • WIN using CORI scoring more effective than current LM for environments that harness database profiling via sampling • Language Modeling more sensitive to sparse data issues • Post-process Lexical Analysis contributes significantly to performance • Random-sampling Profile Creation outperforms Query-based sampling in the WL environment 28th International VLDB '02 — J. Conrad

Future Work May show promise for domains in which we know much less about the pre-existing doc structure • Document Clustering • Basis for new categories of databases • Language Modeling • Harness robust smoothing techniques • Measure contribution to logical DB performance • Actual document-level relevance • Expand set of relevance judgments • Assess doc scores based on both DB + doc beliefs • Bi-modal User Analysis • Complete automation vs. User interaction in DBS Competing w/ high perf thanks to CORI Smoothing: Simple, linear, smallest binomial, finite element, b-spline 28th International VLDB '02 — J. Conrad

Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou Research & Development Thomson Legal & Regulatory – West Group St. Paul, Minnesota 55123 USA {Jack.Conrad,Peter.Jackson}@WestGroup.com

Related Work • L. Gravano, et al., Stanford (VLDB 1995) • Presented GlOSS system to assist in DB selection task • Used ‘Goodness’ as measure of effectiveness • J. French, et al., U. Virginia (SIGIR 1998) • Came up with metrics to evaluate DB selection systems • Began to compare effectiveness of different methods • J. Callan, et al., UMass. (SIGIR 95+99, CIKM 2000) • Developed Collection Retrieval Inference Net (CORI) • Showed CORI was more effective than GlOSS, CVV, others 28th International VLDB '02 — J. Conrad

Background • Exponential growth of data sets on Web and in commercial enterprises • Limited means of narrowing scope of searches to relevant databases • Application challenges in large domain-specific operational environments • Need effective approaches that scale and deliver in focused production systems 28th International VLDB '02 — J. Conrad

Sample Results 28th International VLDB '02 — J. Conrad

Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou Research & Development