280 likes | 289 Views
This research paper explores the problem of efficiently ranking and covering overlapping sources in the context of answering top-k queries. The authors propose a greedy solution and an approximate solution to estimate ranking. The experimental results show promising results, and future work is discussed.
E N D
Answering TopK Queries in the Presence of Overlapping Sources IIWeb 2007 Louiqa Raschid, Yao Wu University of Maryland Maria Esther Vidal Universidad Simon Bolivar Jens Bleiholder, Felix Naumann Hasso Plattner Institut, Potsdam Supported in part by the German Research Society (DFG grant no. NA 432) and NSF Grants IIS02228479 and IIS0430915.
Motivation • Biological Web • Other applications • Problem Definition • Cover(TopK) • Rank to produce TopK • Solutions • Greedy solution to Cover(TopK). • Approximate solution EstimateRank for TopK • Experimental results • Future work IIWeb 2007, LR
Paths through the biological Web • Life science sources, e.g., NCBI/NIH • BioFast Project has created a data warehouse of nodes and edges. • Data is updated continuously and sources also support additional features, e.g., annotations, so must download the current objects. • Query: “Return all publications in PubMed that are linked to an OMIM entry that is related to the disease cancer”. OMIM (Genes) Protein Nucleotide (Sequences) PubMed (Citations) IIWeb 2007, LR
Om Pr Nu Pu Choosing paths P1 P2 P5 P4 P5 P1 P4 P3 P3 P2 Publication space • Multiple sources or source paths. • Benefit: Result cardinality; Reduced by overlap; • Determine quality or rank to identify TopK. • Cost: Number of calls to sources or paths; • Number of downloaded objects. IIWeb 2007, LR
Document / Data Space • Take home message • Interesting challenge. • Applicable to many II domains. • Several optimization problems. IIWeb 2007, LR
Challenges • Set of target objects = TO. • Query Evaluation costs: • Evaluating a source path may require visiting multiple sources. • Intermediate objects and links may have to be downloaded. • Objects in TO have to be downloaded. • Overlap of paths: An object in TO may be reached by different source paths. • Compute rank to determine TopK: • Suppose that the objects and links between them are locally stored. • Computing the exact ranking for TO may require visiting all paths and could be expensive. • Importance/authority flow based metrics such as PageRank, PathCount, … • Other quality metrics. IIWeb 2007, LR
P1 P2 P5 P4 P3 Two related problems • Problem Cover(TopK): Find a minimal set of sources or source paths (minimum can be least number or least cost) so that the TopK objects in TO, TOk, is reached. • Cover(TopK) assumes that TOk is known a priori. • Problem Rank: Determine the exact ranking for some ranking metric. • Problem EstimateRank: Estimate the ranking so that there is a high correlation between exact and approximate ranking.
Motivation • Biological Web • Other applications • Problem Definition • Cover(TopK) • Rank • Solutions • Greedy solution to Cover(TopK). • Approximate solution EstimateRank • Experimental results for Cover(TopK). • Future work IIWeb 2007, LR
Minimize Subject to for all j for all j Minimize Subject to for all j for all i for all j Cover(TopK) IP formulation • Given a collection of sources • or source paths • S={s1,s2, …, sm} with cost c(si) • Given a world of target • objects Z = {z1,z2, …, zn} • TOk : set of TopK target • objects. • xi = 1 iff a source or source path • si is in the solution. • yj = 1 object iff zj is in the • solution. • tj =1 iff object zj is in TOK .
GreedyCover(TopK) • Rank source or source path si based on in descending order. • Counti is the count of the number of objects in si that occur in TOK. • Costi is the cost to access a source or source path or a computation cost. • Pick source or source path st with the largest value of ratio Ri. • Adjust the ratio of remaining paths. Rank. • Continue choosing sources or paths until TOK is covered. IIWeb 2007, LR
Experimental Study Dataset: • Biological database gathered from NCBI/NIH • 26 million objects. • 19.4 million links. • Typical query may have 5 to 14 sources or source paths. • Ten queries • Target object set ranges from 10,000 to 30,000 objects. Experiments: • SunBlade 1000 1GB RAM • System implemented in Java 1.4.2 Ranking Metric for EstimateRank: • Path Count IIWeb 2007, LR
Greedy versus Optimal for Cover(TopK) Find all PubMed entries that are reached from NCBI protein entries that contain the keyword "aging". Suppose the cost = the cost of accessing sources. The greedy solution typically picks more paths than the optimal solution and is inefficient. IIWeb 2007, LR
Greedy versus Optimal for Cover(TopK) Query: Find all PubMed objects that are reached from NCBI protein objects that contain the keyword "aging". Suppose the cost = the query execution time for a path query in the data warehouse. Greedy is sometimes comparable to optimal. (The warehouse design minimizes execution costs.) IIWeb 2007, LR
Greedy versus Optimal for Cover(TopK) Greedy solution covers almost the same total number of objects as the optimal solution if we ignore overlap. IIWeb 2007, LR
Greedy versus Optimal for Cover(TopK) Target objects are stored in remote sources and are updated. Must download objects for *real* query processing. Greedy often has an overhead; it downloads many overlap objects. • Greedy can perform well. • It may contact many sources. • It may download many objects. IIWeb 2007, LR
Motivation • Biological Web • Other applications • Problem Definition • Cover(TopK) • Rank to produce TopK • Solutions • Greedy solution to Cover(TopK). • Approximate solution EstimateRank • Experimental results • Metric independent sampling • Metric dependent sampling (Path Count) • Future work IIWeb 2007, LR
EstimateRank • Consider the biological Web. • Consider a navigational query. • Consider a (result) graph RG, a metric M, and a benefit score bj for object zj . • Next consider a sampled graph RG’ which is a subset of nodes and edges of RG. • Efficiently estimate the benefit score est_ben(zj, RG’, M) for object zj in TO so that the relative error of estimating the benefit meets some confidence level α, i.e., high confidence. Sampling termination conditions are determined by overall characteristics of the population, e.g., mean and variance. When the characteristics of the population are unavailable, different methods are used to estimate these statistics. IIWeb 2007, LR
Metric independent sampling • The only assumption made is that the value of the quality score is obtained from a distribution. • Does not assume any additional knowledge of the properties of the metric used to determine a quality score. • TO’ is a sample of TO such that each zj’ is randomly chosen from TO without replacement. • To ensure that the estimation of the quality score / rank meets the desired convergence bounds, the sample size m is defined using the Chernoff bound. IIWeb 2007, LR
Metric dependent sampling • Can consider any metric, e.g., PageRank or • ObjectRank • We illustrate using a simpler metric Path Count. • Path count for target object o, PC(o), is the number of paths that reach o in the result graph. d PC(d) = 6 PC(e) = 2 PC(q) = 3 e q Layer 1 Layer 2 Layer 3 Layer 4 IIWeb 2007, LR
BLj,k-1 Sampling to estimate Path Count BLj,k-1 is the back link set at level k-1 for node j at level k. Path Count iterates over the elements in the back link set. • Objects in BLj,k-1 are sampled without replacement. • The estimated Path Count of object j = (s/m)*Card(BLj,k-1) • where s is the sum of the Path Count values of the • sampled objects taken from BLj,k-1, and m is the size of • the sample. • Different sampling techniques are used to sample • BLj,k-1 and to determine when to stop sampling.
Sampling Techniques • Adaptive Sampling [Lipton, Naughton, Schneider 1990]: • Mean and Variance are approximated using some upper bounds. • Product of the maximal indegree of the links used in the result graph is considered as upper bound. • Double Sampling [Hou, Ozsoyoglu, Dogdu 1991]: • In a first stage of the sampling, a small portion of the data is sampled to estimate mean and variance. • Sequential Sampling [Haas, Swami 1992]: • Mean and variance are recomputed during each iteration of the sampling. IIWeb 2007, LR
Experimental Study Sampling Parameters • b: upper bound of the Path Count score = 0.25* • Confidence level α = 0.99 • Error = 0.01 • Sample Percentage in the First Phase of Double Sampling: 2% of the population • Report on the correlation of the estimated Path Count versus exact Path Count.
Metric Independent Solution – good for quality scores • High correlation if sample size is close to actual size of the query answer. • From low to moderate degree of correlation if sample size is less than 50% of the actual size of the query answer. • Performance is dependent on sufficient sampling.
Adaptive Sampling PC Predictive capacity • High correlation between estimated PC scores and actual PC scores. • Predictive capacity does not seem to be severely affected by uniformity of actual PC scores. • Marked degree of correlation (0.80) if PC scores are not uniformly distributed. Average correlation: 0.91
Double Sampling PC Predictive capacity • The predictive capacity may be affected by uniformity of actual PC scores. • High correlation if actual PC scores are uniformly distributed. • Moderate degree of correlation (0.46) if PC scores are not uniformly distributed. Average correlation: 0.81
Sequential Sampling PC Predictive capacity • High correlation between estimated PC scores and actual PC scores in all the queries. • Predictive capacity does not seem to be affected by uniformity of actual PC scores. • Best overall performance. • Estimate Rank • Metric independent and metric dependent estimation. • Possible to estimate with high confidence (e.g., sequential sampling). Averaged Correlation: 0.95
Sampling Time versus PC Evaluation Time • PC evaluation time is larger than the time of almost all the samplingstrategies. IIWeb 2007, LR
Future Work • Consider more complex queries. • Extend to PageRank and ObjectRank and other Quality metrics. • Extend cost model to consider actual query execution times on remote sources. • Extensive experiments for sampling. • Extend to other application domains, e.g., Intranets, bibliographics data collections, scientific datasets, etc. IIWeb 2007, LR