190 likes | 272 Views
Human Expertise and Artificial Intelligence in Vertical Search. Peter Jackson & Khalid Al-Kofahi Corporate Research & Development. Horizontal versus Vertical Search. The Paradox of Search. The further you get from keyword indexing and retrieval, the harder it is to explain a search result
E N D
Human Expertise and Artificial Intelligence in Vertical Search Peter Jackson & Khalid Al-Kofahi Corporate Research & Development
The Paradox of Search • The further you get from keyword indexing and retrieval, the harder it is to explain a search result • Professional searchers demand transparency • Tool versus appliance • You need an ‘explanatory model’ that people can relate to and understand, even if it is actually just a cartoon of the real process • Examples: Basic PageRank, Collaborative Filtering • Such models don’t work so well in vertical domains • Links aren’t always endorsements • Sparsity of data in smaller communities
Recent Trends in Search • Fragmentation of ‘horizontal’ search • Media, location, demographics (Weber & Castillo, 2010) • More sophisticated models of user behavior • Post-click behaviors (Zhong, Wang, et al, 2010) • ‘Practical semantics’ versus Semantic Web • Maps as search results for local, micro-results • Incorporation of domain knowledge into search • Taxonomies, vocabularies, use cases, work flows
The Example of Legal Search • The completeness requirement • Recall as important as precision • Less redundancy than on the Web • The authority requirement • Court superiority, jurisdiction • Highly cited cases and statutes • Supercession by statute or regulation • The multi-topical nature of documents • Case may cover many points of law but only cited for one • Citations can be negative as well as positive per topic >These factors also apply to scientific documents
Expert Search • In many verticals, there are at least two sources of expertise available for enhancing search • Editors and authors, who generate useful metadata • Users, who generate clickstreams and other data • Editorial value addition improves recall especially • Helps find both fat neck and long tail document on a topic • Aggregate user behavior mostly improves precision • Power users find most relevant and important documents • The model of expert search enables and explains the portfolio of results, rather than individual results
Sources of Evidence:Authors & Editors case Burger King Corp, V. Rudzewicz case case = = = = = = = = = 17201 3 (A) 28 (B) 35 4 (A) 5 (B) = = = = = = = = = = = = = = = = = = Headnote, KN Headnote, KN text text text text citation text citation text text case case case = = = = = = = = = = = = = = = = = = = = = = = = = = = 205,310 5 (A) 19 (B) case case case case = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Issue: Long arm jurisdiction 12 A (Key cases) 54 B (Highly Relevant) 9
Sources of EvidenceAuthors & Editors cases cases ALR = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Burger King Corp, V. Rudzewicz cases cases CJS = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = HN1 KN1 HN2 KN2 HN3 KN2 …. …. …. .... HN35 KN14 cases cases AMJUR = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Another set of related cases 10
Sources of Evidence: Users (I) cases Session 1 = = = = = = = = = = = = = = = = = = = = = = = = = = = Click Query 1 Burger King Corp, V. Rudzewicz Click Actions Print Query 2 KeyCite Query 3 cases Session N = = = = = = = = = = = = = = = = = = = = = = = = = = = Print Actions Click Query N Link query language to document language via click, print, and cite checking behaviors Identify documents that are co-clicked, co-printed, etc, with the Burger King case across user sessions 11
Sources of Evidence: Users (II) cases Session 1 In the last 3 months = = = = = = = = = = = = = = = = = = = = = = = = = = = Burger King Corp, V. Rudzewicz Click Actions Query 1 Original breach of contract and trademark infringement case turned into a civil procedure case about jurisdictionon appeal "personal jurisdiction” 176"minimum contacts” 50"forum selection clause” 39“personal jurisdiction” 39"forum non conveniens” 32"choice of law” 29 cases Session N = = = = = = = = = = = = = = = = = = = = = = = = = = = Print Actions Query N User actions: 10417 Total sessions: 9758 12
AI & The Ranking Problem • Supervised Machine Learning (Ranker SVM) • Iteratively retrieve and rank documents • Incorporate all available cues: text similarity, classifications, citations, user behavior and query logs • All of this requires lots of data! • Training & Validation • Gold data: hand-crafted research reports covering a variety of legal issues • Report contains an issue statement, multiple queries, all seminal, highly relevant documents, some relevant docs • > 100K documents judged against ~400 legal issues • System was also tested by an independent 3rd party
Hadoop for Big Data Processing • At launch, query logs contained ~ 2 Billion records • Queries & user actions • Relied on a Hadoop cluster to • Extract, Transform, and Load processes. • Cluster similar queries together • Extract, normalize, collate citation contexts • Dramatic improvement in processing times • From tens of hours to tens of minutes
Cluster Configuration: Queries • 8 machines, each with 16 cores • Only 14 cores/machine were available for processing • Giving a total of 112 cores • Block size of 64 MB • Each core processes one block at a time • Cluster can process 7 GB at each step • Latest cluster is twice the size: 224 cores • Almost 1 TB of memory and over 1 PB of storage
The Power of Expert Search • Leverages expertise of community: authors, editors, & users • We know why documents are linked • We know exactly who our users are • Metadata, authority & aggregated user data all contribute to relevance, importance & popularity • Can still benefit from Power Law phenomena so common on the Web • Can exploit data parallelism to achieve the same kind of scale as horizontal search
Lessons Learned • Vertical search is not just about search • It’s about findability • Includes navigation, recommendations, clustering, faceted classification, etc. • It’s about satisfying a set of well-understood tasks • Usually on enhanced content • Usually for expert customers • Leveraging human value addition is key • None of the human actors set out to improve search • Difficult to design complete solution upfront • Need platform for experimentation and validation at scale
questions? • A relevant paper is downloadable from http://labs.thomsonreuters.com