320 likes | 462 Views
Dynamic Faceted Search for Discovery-driven Analysis. Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18. Outline. Introduction Terminology and Problem Statement Measure of “Interestingness”
E N D
Dynamic Faceted Search for Discovery-driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18
Outline • Introduction • Terminology and Problem Statement • Measure of “Interestingness” • Implementing Dynamic Faceted Search • Evaluation • Conclusion and Future work
Introduction • Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration • To preserve browsing consistency, facets selected for navigation tend to be “static” • When browsing online catalogs, the navigational facets are single-dimensional only
Introduction • Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems • From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user
Terminology and Problem Statement • Defn 1. • A repository D is a collection of documents • Each of which is composed of some free text and one or more <facet: value> pairs • Given a value f in facet F, we call <F:f> an instance of F • All unique values associated with a facet F form the domain of F
Terminology and Problem Statement • Defn 2. • Organize the domain of these facets into a facet hierarchy • Each node in the hierarchy stores a <facet: value> pair • A node <F1: f1> is the parent of another node <F2: f2> if for each document, F2 = f2 implies F1 = f1
Terminology and Problem Statement • Defn 3. • Assume a query q on the repository has the form “keywords && F1 = f1 && F2 = f2…” • The result of q is denoted by Dq • Includes the set of documents having the specified keywords • Satisfying all constraints on selected facets
Terminology and Problem Statement • Defn 4. • Given a query q, define a facet summary for a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq • fi is an instance of facet Fi • A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances
Terminology and Problem Statement • Problem Definition: • Given a repository of documents with n facets, a query q, 2 integers K1 & K2 • select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user
Measure of “Interestingness” • Interestingness: How surprising an actual aggregated value is, given a certain expectation
Measure of “Interestingness”*Setting the Expectation • For a given set of facet values f1, …, fmfrom F1, …, Fm: • CD(f1, …, fm ): the count of the number of documents with all those facet values in D • Cq(f1, …, fm ): the count of the number of documents with all those facet values in Dq • E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) • Natural、navigational、ad hoc
Measure of “Interestingness”*Setting the Expectation • Natural: • For an individual facet instance <F:f>: (uniformity assumption) • For an instance f1, …, fm of a facet set: (independence assumption)
Measure of “Interestingness”*Setting the Expectation • Navigational: • Ad hoc: • User can tell the system to set expectation based on an arbitrary query q of the user’s choice • Set the count for each facet value proportionally based on the distribution of the result of q
Measure of “Interestingness”*Measuring Degree of Interestingness • Single facet instance: • By evaluating it with respect to a scenario in which its associated count is generated by random sampling • The smaller the probability of observing the count under random sampling, the more interesting the facet instance
Measure of “Interestingness”*Measuring Degree of Interestingness • p-value: • Suppose that a certain facet value occurs in r out of R documents in the repository and in q out of Q documents in the output of a certain query • Also suppose • The interestingness of that facet value vis-à-vis the query: the probability that in a random sample of size Q there will be at least q documents with that facet value • hypergeometric distribution normal distribution or Poisson distribution
Measure of “Interestingness”*Measuring Degree of Interestingness • The whole facet: • For each facet F, we consider the p-values of only the k most interesting values in F • , replace • The final measure: • MaxWeight: assign 1 to w1 and 0 to the rest • AvgWeight: assign each wi an equal weight • HybridWeight: average the interesingness computed by MaxWeight and AvgWeight
Implementing Dynamic Faceted Search • Solr: indexes facets without storing them • Enumerates every facet instance <F: f> from the index and intersects its posting list with Dq • From the intersected set, it derives the count on facet value f • Caches each posting list to a bitset • If the bitset is dense: bitmap • Otherwise: a hash map of document IDs
Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 1: has to choose a threshold that decides the representation of the bitset • represent a bitset as a compressed bitmap using Word-Aligned Hybrid (WAH) code
Implementing Dynamic Faceted Search • WAH • There are 2 types of words: • Literal words: a verbatim representation of 31 bits • Fill words: encodes the length of a list of all 0’s and 1’s in 30 bits • A bitmap is broken into groups of 31 bits first and then converted into a sequence of literal and fill words • Operations on bitmaps such as intersection can be performed on WAH code directly without decoding
Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 2: it has to intersect the matching document set Dq with the bitset of every facet instance • reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet
Implementing Dynamic Faceted Search • Building and Using a Bitset Tree • Starting with the leaf nodes, for each bitset b corresponding to facet instance <F: f>, we create an entry <b, null> • Then divide all entries into groups of size s • For each group, we generate a leaf node holding all entries in that group
Evaluation*Setup • DBLP • Contains about 13,000 papers published in 26 venues (e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years • It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper • Use the title of each paper as text for keywords searches • Conduct the user survey
Evaluation*Setup • Patent • Has about 1.8 million U.S. patents from the past 30 years • 16 facets organized into 10 hierarchies • Use for performance evaluation
Evaluation*Result from a User Survey • Performed tests on 3 keyword queries • 2 are provided by author: “distributed”, “mining” • Users pick the 3 keyword • 1 base on natural • 2 base on navigational • 1 used complete repository • 1 used previous query
Evaluation*Result from a User Survey • Our dynamic approach also received some negative feedback • Overall, the feedback for the natural expectation is neutral • Different ways of aggregating the degree of interestingness • HybridWeight(7) > MaxWeight(6) > AvgHeight(2)
Evaluation*Performance Results • Environment: • Implemented in Java • 3GHz P4 desktop machine with 1GB memory • A single disk drive, running Linux • Version: • simple: inverted index • Solr • compressed: improves Solr by WAH code • tree: improves Solr by bitset trees • compressed-tree: both WAH and bitset tree on Solr
Evaluation*Performance Results • Scaling with Data Size • Run a query that matches 25,000 docs using tree • Break the total time into search time & summary computation time
Conclusion and Future Work • Develop a novel dynamic faceted search system • support OLAP-style discovery-driven analysis • on a large set of structured and unstructured data • Propose an intuitive and effective way of measuring “interestingness” • Propose a novel navigational ,method of setting a user’s expectation
Conclusion and Future Work • Incorporate user feedback in facet selection • How to extend the aggregates to functions other than count • Sum, average on some numerical measures • How to support dynamic faceted search in a distributed environment