460 likes | 561 Views
Distributed Search over the Hidden Web. Hierarchical Database Sampling and Selection. Panagiotis G. Ipeirotis & Luis Gravano. Outline. Introduction Background Focused Probing for Content Summary Construction Exploiting Topic Hierarchies for Database Selection
E N D
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano
Outline • Introduction • Background • Focused Probing for Content Summary Construction • Exploiting Topic Hierarchies for Database Selection • Experiments: Data and Metrics • Experimental Results • Conclusion and Future Work
Introduction • Search engines create their indexes by spidering orcrawling Web pages • Hidden Web sources store their content in searchable databases
An Example… • Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches. • Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matches
Meta - searchers • Tools for searching Hidden data sources • Relies on statistical or content summaries • Performs three main tasks: • Database Selection • Query Translation • Result Merging
This Paper Presents • An algorithm to derive content summaries from “uncooperative” databases • A Database Selection Algorithm that exploits: • Extracted content summaries • Hierarchical classification of the databases
Contd. : Database Selection Algorithms • Assumption: Query words are independently distributed over database documents. • The answer to a query is the set of all the documents that satisfy the Boolean expression. • Deficiency: Are the content summaries accurate and up to date?
Uniform Probing for Content Summary Construction • Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)
The Algorithm: • Start with an empty content summary where SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary. • Pick a word and send it as a query to database D. • Retrieve the top-k documents returned. • If the number of retrieved documents exceeds a pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.
2 Versions of Algorithm • RS-Ord : RandomSampling-OtherResource • RS-Lrd : RandomSampling-LearnedResource
Deficiencies: • ActualDF(w) for each word w is not revealed • RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches
Database Classification • Rationale : Queries closely associated with topical categories retrieve mainly documents about that category • Place the database in a classification scheme based on the number of matches
Automation andHierarchical Classification • Automates classification by queries derived automatically from a rule-based document classifier. • A rule-based classifier is a set of logical rules defining classification decisions. • jordan AND bulls-->Sports, hepatitis-->Health • Apply this principle recursively to create a hierarchical classifier.
Focused Probing • Sends query probes, and extracts number of matches without retrieving any documents. • Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci • If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci
Author’s Algorithm • Exploit Topic Hierarchy • Produce a document sample that • Is topically representative of the contents • Gives accurate and efficient content summary
Content-Summary Construction • Steps of the algorithm: • Query the database using focused probing to: • Retrieve a document sample. • Generate a preliminary content summary • Categorize the database. • Estimate the absolute frequencies of the words retrieved from the database.
Building Content Summaries from Extracted Documents • ActualDF(w): • The actual number of documents in the database that contain word w. • The algorithm knows this number only if [w] is a single word query probe that was issued to the database • SampleDF(w): • The number of documents in the extracted sample that contain word w.
Estimating Absolute Document Frequencies • Use Mandelbrot’s equation P(r+p)-B for distribution of words for estimating unknown ActualDF (¢) frequencies. • Sort words in descending order of their SampleDF(¢) frequencies • Focus on words with known ActualDF (¢) frequencies. • Find the P, B, and p parameter values that best fit the data. • Estimate ActualDF (wi) for all words wi with unknown ActualDF (wi) as P(ri+p)-B
Creating Content Summaries for Topic Categories Example: • “metastasis” did not appear in any of the documents sampled from CANCERLIT during probing • Cancer-BACUP classified under “Cancer”, has a high ActualDFest(metastasis) = 3, 569 • Convey this information by associating a content summary with category “Cancer” that is obtained by merging the summaries of all databases under this category • In merged summary, ActualDFest(w) is sum of the document frequency of w for databases under this category
Selecting Databases Hierarchically: Algorithm • Inputs : a query Q, target databases K, top category C • Steps: HierSelect(Query Q, Category C, int K) 1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C 2: if there is a subcategory C with a non-zero score 3: Pick the subcategory Cj with the highest score 4: ifNumDBs(Cj) >= K //Cj has enough databases 5: return HierSelect(Q,Cj ,K) 6: else // Cj does not have enough databases 7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj)) 8: else // no subcategory C has non-zero score 9: return FlatSelect(Q,C,K)
Example: Topic hierarchy for database selection (babe AND ruth ,k=3)
Experiments :Data and Metrics • Evaluate two main sets of techniques: 1.Content-summary construction techniques 2. Database selection techniques • Evaluate the algorithms, using two data sets • Controlled Database Set • Web Database Set
Data Sets • Controlled Database Set • 500,000 newsgroup articles from 54 newsgroup • 81,000 articles to train documents classifiers over the 72 – node topic hierarchy • 419,000 articles to build the set of Controlled Databases • Contained 500 databases ranging in size from 25 to 25,000 documents.
Data Sets • Web Database Set • 50 real web accessible databases with no control over it. • Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet
Content-summary construction • Test variations of Focused Probing technique against RS-Ord and RS-Lrd. • Focused Probing: • Evaluated configurations with different underlying document classifiers for query-probe creation. • Different values for the thresholds Ts and Tc • Varied the specificity threshold Ts from 0 to1 • Fixed coverage threshold to Tc = 10.
Database Selection Effectiveness • Underlying Database selection algorithm: Hierarchical algorithm • Relies on a “flat” database selection algorithm. • Chose algorithms: CORI, bGlOSS • Adapted both algorithms to work with category content summary.
Database Selection Effectiveness • Content Summary Construction • Evaluated how the hierarchical database selection algorithm behaved over content summaries generated by different techniques • Also studied QPilot Strategy • Exploits HTML links to characterize text databases.
Content Summary Quality • Metric : content summaries coverage of the actual database vocabulary • ctf = ΣwєTrActualDF(w) / ΣwєTdActualDF(w) • Tr = set of terms in content summary, Td = complete set of words in vocabulary • Results: • Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd. • The coverage of the Focused Probing summaries increases for lower thresholds of Ts
Content Summary Quality • Correlation of word rankings: UsedSpearman Rank Correlation Coefficient (SRCC ) – • to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database. • Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS-Lrd.
Content Summary Quality - Efficiency • Focused Probing techniques on average retrieve one document per query sent • RS-Lrd retrieves about one document per two queries. • RS-Ord unnecessarily issues many queries that produce no document matches.
Content Summary Quality • Produce significantly better-quality summaries than RS-Ord and RS-Lrd do • in terms of vocabulary coverage • and word ranking preservation.
Database Selection Effectiveness • Methodology: • Web set of real web-accessible databases • 50 queries from the Web Track of TREC • Each database selection algorithm picked 3 databases for the query • Retrieved the top 5 documents for the query. • Human evaluators to judge • the relevance of each retrieved document for the query
Database Selection Effectiveness • Measured the precision of a technique for each query q as : Average precision of different database selection algorithms.
Database Selection Effectiveness • Analysis: • All the flat selection techniques suffer from incomplete coverage of the underlying probing-generated summaries. • QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.
Hierarchical vs. flat database selection • The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries. • For bGlOSS, the improvement is 92%. • Reason: Topic hierarchy compensates for incomplete content summaries.
Hierarchical vs. flat database selection • Measured fraction of times that hierarchical database selection algorithm picked a database for a query • That produced matches for the query • And was given a zero score by the flat database selection algorithm of choice.
Conclusion • Presented a novel and efficient method for the construction of content summaries of web accessible text databases • Presented a hierarchical database selection algorithm that exploits the database content summaries • Algorithm generated classification to produce accurate results even for imperfect content summaries.
Future Work • Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate. • Examine the effect of absolute frequency estimation on database selection. • Alternative methods for creating content summaries.