Dynamic Faceted Search for Discovery-driven Analysis

Dynamic Faceted Search for Discovery-driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18

Outline • Introduction • Terminology and Problem Statement • Measure of “Interestingness” • Implementing Dynamic Faceted Search • Evaluation • Conclusion and Future work

Introduction • Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration • To preserve browsing consistency, facets selected for navigation tend to be “static” • When browsing online catalogs, the navigational facets are single-dimensional only

Introduction • Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems • From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user

Terminology and Problem Statement • Defn 1. • A repository D is a collection of documents • Each of which is composed of some free text and one or more <facet: value> pairs • Given a value f in facet F, we call <F：f> an instance of F • All unique values associated with a facet F form the domain of F

Terminology and Problem Statement • Defn 2. • Organize the domain of these facets into a facet hierarchy • Each node in the hierarchy stores a <facet: value> pair • A node <F1: f1> is the parent of another node <F2: f2> if for each document, F2 = f2 implies F1 = f1

Terminology and Problem Statement • Defn 3. • Assume a query q on the repository has the form “keywords && F1 = f1 && F2 = f2…” • The result of q is denoted by Dq • Includes the set of documents having the specified keywords • Satisfying all constraints on selected facets

Terminology and Problem Statement • Defn 4. • Given a query q, define a facet summary for a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq • fi is an instance of facet Fi • A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances

Terminology and Problem Statement • Problem Definition: • Given a repository of documents with n facets, a query q, 2 integers K1 & K2 •  select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user

Measure of “Interestingness” • Interestingness: How surprising an actual aggregated value is, given a certain expectation

Measure of “Interestingness”*Setting the Expectation • For a given set of facet values f1, …, fmfrom F1, …, Fm: • CD(f1, …, fm ): the count of the number of documents with all those facet values in D • Cq(f1, …, fm ): the count of the number of documents with all those facet values in Dq • E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) • Natural、navigational、ad hoc

Measure of “Interestingness”*Setting the Expectation • Natural: • For an individual facet instance <F：f>: (uniformity assumption) • For an instance f1, …, fm of a facet set: (independence assumption)

Measure of “Interestingness”*Setting the Expectation • Navigational: • Ad hoc: • User can tell the system to set expectation based on an arbitrary query q of the user’s choice • Set the count for each facet value proportionally based on the distribution of the result of q

Measure of “Interestingness”*Measuring Degree of Interestingness • Single facet instance: • By evaluating it with respect to a scenario in which its associated count is generated by random sampling • The smaller the probability of observing the count under random sampling, the more interesting the facet instance

Measure of “Interestingness”*Measuring Degree of Interestingness • p-value: • Suppose that a certain facet value occurs in r out of R documents in the repository and in q out of Q documents in the output of a certain query • Also suppose • The interestingness of that facet value vis-à-vis the query: the probability that in a random sample of size Q there will be at least q documents with that facet value • hypergeometric distribution  normal distribution or Poisson distribution

Measure of “Interestingness”*Measuring Degree of Interestingness • The whole facet: • For each facet F, we consider the p-values of only the k most interesting values in F • , replace  • The final measure: • MaxWeight: assign 1 to w1 and 0 to the rest • AvgWeight: assign each wi an equal weight • HybridWeight: average the interesingness computed by MaxWeight and AvgWeight

Implementing Dynamic Faceted Search • Solr: indexes facets without storing them • Enumerates every facet instance <F: f> from the index and intersects its posting list with Dq • From the intersected set, it derives the count on facet value f • Caches each posting list to a bitset • If the bitset is dense: bitmap • Otherwise: a hash map of document IDs

Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 1: has to choose a threshold that decides the representation of the bitset • represent a bitset as a compressed bitmap using Word-Aligned Hybrid (WAH) code

Implementing Dynamic Faceted Search • WAH • There are 2 types of words: • Literal words: a verbatim representation of 31 bits • Fill words: encodes the length of a list of all 0’s and 1’s in 30 bits • A bitmap is broken into groups of 31 bits first and then converted into a sequence of literal and fill words • Operations on bitmaps such as intersection can be performed on WAH code directly without decoding

Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 2: it has to intersect the matching document set Dq with the bitset of every facet instance • reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet

Implementing Dynamic Faceted Search • Building and Using a Bitset Tree • Starting with the leaf nodes, for each bitset b corresponding to facet instance <F: f>, we create an entry <b, null> • Then divide all entries into groups of size s • For each group, we generate a leaf node holding all entries in that group

Evaluation*Setup • DBLP • Contains about 13,000 papers published in 26 venues (e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years • It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper • Use the title of each paper as text for keywords searches • Conduct the user survey

Evaluation*Setup • Patent • Has about 1.8 million U.S. patents from the past 30 years • 16 facets organized into 10 hierarchies • Use for performance evaluation

Evaluation*Result from a User Survey • Performed tests on 3 keyword queries • 2 are provided by author: “distributed”, “mining” • Users pick the 3 keyword • 1 base on natural • 2 base on navigational • 1 used complete repository • 1 used previous query

Evaluation*Result from a User Survey

Evaluation*Result from a User Survey • Our dynamic approach also received some negative feedback • Overall, the feedback for the natural expectation is neutral • Different ways of aggregating the degree of interestingness • HybridWeight(7) > MaxWeight(6) > AvgHeight(2)

Evaluation*Performance Results • Environment: • Implemented in Java • 3GHz P4 desktop machine with 1GB memory • A single disk drive, running Linux • Version: • simple: inverted index • Solr • compressed: improves Solr by WAH code • tree: improves Solr by bitset trees • compressed-tree: both WAH and bitset tree on Solr

Evaluation*Performance Results • Scaling with Data Size • Run a query that matches 25,000 docs using tree • Break the total time into search time & summary computation time

Evaluation*Performance Results

Conclusion and Future Work • Develop a novel dynamic faceted search system • support OLAP-style discovery-driven analysis • on a large set of structured and unstructured data • Propose an intuitive and effective way of measuring “interestingness” • Propose a novel navigational ,method of setting a user’s expectation

Conclusion and Future Work • Incorporate user feedback in facet selection • How to extend the aggregates to functions other than count • Sum, average on some numerical measures • How to support dynamic faceted search in a distributed environment

Dynamic Faceted Search for Discovery-driven Analysis

Dynamic Faceted Search for Discovery-driven Analysis

Presentation Transcript

Implementing a Faceted Search Framework

Faceted Metadata in Search Interfaces

Dynamic Service Discovery

Faceted Search

Faceted Metadata in Search Interfaces

Scenario-Driven Dynamic Analysis of Distributed Architectures

Faceted Search for Hydrologic Data Discovery

Island driven search

Faceted Metadata for Site Navigation and Search

Faceted Search

PoC Dynamic Discovery

Beyond Basic Faceted Search

Unsupervised Constraint Driven Learning for Transliteration Discovery

Discovery-Driven Graph Summarization

Minimum-Effort Driven Dynamic Faceted Search in Structured Databases

Best Practices for Designing Faceted Search Filters

IO-Efficient Faceted Search

Dynamic Invariant Discovery

FEMA: Flexible evolutionary Multi-faceted analysis for dynamic behavior pattern Discovery

Faceted Metadata in Search Interfaces

Dynamic Search

Search Driven Insights