780 likes | 951 Views
Distributed Search over the Hidden Web:. Hierarchical Database Sampling and Selection. Agenda. The Hidden – Web Database selection algorithms An algorithm to extracts a document sample Database classification An algorithm for content summary construction
E N D
Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection
Agenda • The Hidden – Web • Database selection algorithms • An algorithm to extracts a document sample • Database classification • An algorithm for content summary construction • Estimating document frequencies, ActualDF() frequencies • Database selection algorithms using categorization and content summary • Experiments
The Hidden Web • Also know as the Deep Web • Most of the Web's information is buried far down on dynamically generated sites • Standard search engines never find it.
The Hidden Web • Search engines create their indexes by spidering or crawling Web pages. • To be discovered, the page must be static and linked to other pages. • Search engines can not retrieve content in the Hidden Web
The Hidden Web • Those pages do not exist until they are created dynamically as the result of a specific search. • Hidden Web sources store their content in searchable databases • Those databases only produce results dynamically in response to a direct request. • A direct query is a "one at a time" laborious way to search.
The Size of the Hidden Web According to a study based on data collected between March 13 and 30, 2000 : • Public information on the hidden Web is currently 400 to 550 times larger than the commonly defined World Wide Web. • Total quality content of the hidden Web is 1,000 to 2,000 times greater than that of the Web • The hidden Web is the largest growing category of new information on the Internet.
Putting those Findings in Perspective • Highest indexed search engines (Google , Northern Light etc.) index up to 16% of the Web. • Since they are missing the hidden Web when they use such search engines • Internet searchers are searching only 0.03% of the pages available to them today.
10-yr. Growth Trends in Cumulative Original Information Content
The main Issues • An algorithm to derive content summaries from “uncooperative” databases by using “focused query probes” • A novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases.
Some Techniques we will discussed • A document sampling technique for text databases that results in higher quality database content summaries • A technique to estimate the absolute document frequencies of the words in the content summaries
Some techniques we will discussed(2) • A database selection algorithm that proceeds hierarchically over a topical classification scheme. • Experimental evaluation of the new algorithms using both “controlled” databases and 50 real web accessible databases.
An Example • Searching in the medical database CANCERLIT – www.cancer.gov the query [lung and cancer] returns 68,430 matches • Searching Google with the query [“lung” and “cancer” site:www.cancer.gov] returns 23 matches • None of the pages which return corresponds to the database documents.
An Example (2) • The results shows that the valuable CANCERLIT content is not indexed by this search engine
Metasearchers • One-stop access to the information in text databases. • Performs three main tasks: • After receiving a query, it finds the best databases to evaluate the query (database selection) • It translates the query in a suitable form for each database (query translation) • It retrieves and merges the results from the different databases (result merging) and returns them to the user.
Database selection algorithms • Based on statistics that characterize each database’s content • They refer to the statistics as content summaries • Usually include the document frequencies of the words that appear in the database
Database selection algorithms(2) • Provide sufficient information to the database selection component of a metasearcher to decide • which databases are the most promising to evaluate a given query.
Database selection algorithms(3) • Tries to find the best databases to evaluate a given query • Uses the document frequency of the word : • the number of different documents that contain each word • Uses the NumDocs – the number of documents stored in the database
An example - bGlOSS – Boolean Glossary-of- Servers Server A Flat Selection Algorithm • Documents are represented as words with position information. • Queries are expressions composed of words, and connectives such as \and," \or," \not," and proximity operations such as \within k words of.“ • The answer to a query is the set of all the documents that satisfy the Boolean expression.
An example - bGlOSS – Boolean Glossary-of- Servers Server • Giving bGlOSS the query [breast AND cancer] returns: • |C|* df(breast)* df(cancer) »= 74; 569 documents in database CANCERLIT. • |c| - the number of documents in the database • Df() – the number of documents that contain a given word |c| |c|
Supply the Content Summary • The metasearcher rely on the database to supply the content summary • If the databases do not report any detailed metadata about their contents - then • Metasearcher will rely on manually generated descriptions of the database contents. • This doesn’t match thousands of text databases
START – Stanford Protocol for Internet Retrieval and Search • An emerging protocol for Internet retrieval and search that facilitated the task of querying multiple document sources. .
START – Stanford Protocol for Internet Retrieval and Search(2) • The goal – to facilitate the main three task a metasearcher preforms: • Choosing the best source to evaluate a query • Evaluating the query at these sources • Merging the query results from these sources. Mainly deals with what information needs to exchanged between sources and metasearchers
An Algorithm to extracts a document sample from a given database • SampleDF(w) - Computes the frequency of each observed word w in the sample,
An Algorithm to extracts a document sample from a given database(2) • Starts with an empty content summary where SampleDF(w) = 0 for each word w, and a general comprehensive word dictionary. • Pick a word and sent it as a query to database D • Retrieve the top-k documents returned • If the number of retrieved documents exceeds a prespecified threshold • then stop. • else continue the sampling process – return to step 2
2 Versions of the algorithm • RS-Ord – RandomSampling- OtherResource • Picks a random word from the dictionary for step 2. • RS-Lrd –RandomSampling- LearnedResource • Selects the next query from among the words that have been already discovered during sampling.
More about the algorithm • The actual frequency ActualDF(w) for each word w • is not revealed by this processes • The calculated document frequencies contain • information about the relative ordering of the words in the database.
More about the algorithm (2) • Two databases with the same focus but differing significantly in size might be assigned similar content summaries. • A word that is randomly picked from the dictionary, is likely not to occur in any document of arbitrary database
Database Classification • A way to characterize the contents of a text database is: • To classify it in hierarchy of topics according to the type of the documents that it contains.
Database Classification • A method to automate the classification of web accessible databases based on the principle of “focused probing” • A rule based document classifier – a set of logical rules defining classification decisions
Hierarchy Classification • Categories can be further divided into subcategories, • Resulting in multiple levels of classifiers, one of each internal node of a classification hierarchy • It is possible to create a hierarchical classifier that will recursively divide the space into successively smaller topics.
Classify a database An algorithm that • uses a hierarchical scheme, • automatically maps rule based document classifiers into queries, • which are then used to probe and classify text databases.
Classify a database(2) The algorithm provides a way to zoom in • on the topics that are most representative of a given database’s contents • we can then exploit it for accurate and efficient content summary construction
Focused Probing for content Summary Construction - Algorithm The algorithm consists of two main steps: • Query the database using focused probing in order to: • Retrieve a document sample. • Generate a preliminary content summary • Categorize the database. • Estimate the absolute frequencies of the words retrieved from the database.
Generating a content summary for a database using focused query probing
Generating a content summary for a database using focused query probing(2)
Generating a content summary for a database using focused query probing(3)
Building Content Summaries from Extracted Documents • ActualDF(w): • The actual number of documents in the database that contain word w. • The algorithm knows this number only if [w] is a single word query probe that was issued to the database • SampleDF(w): • The number of documents in the extracted sample that contain word w.
Building Content Summaries from Extracted Documents(2) • Retrieves the top-k documents returned by each query . • Computes SampleDF(w). • If a word w appears in document samples retrieved during later phases of the algorithm • then all SampleDF(w) values are added together • Keeps track of the number of matches produced by each single word query[w] – ActualDF(w) frequency.
Estimating Absolute Document Frequencies • Exploit the SampleDF(.) frequencies • derived from the document sample to rank all observed words • from most frequent to least frequent. • Exploit the ActualDF(.) frequencies • derived from one word query probes to potentially boost the document frequencies • of “nearby” words w for which we only know SampleDF(w) but not ActualDF(w)
Focused Probing Technique for Content Summary Construction - Summary The technique • Estimates the absolute document frequency of the words in a database. • Automatically classifies the database in a hierarchical classification scheme along the way.
Estimating Unknown ActualDF (¢) Frequencies • After probing we get • The rank of all observed words in the sample documents retrieved. • The actual frequencies of some of those words in the database
Estimating Unknown ActualDF (¢) Frequencies(2) • A relationship between the rank r and the frequencies f of a word • f= P(r+p)-B • P, B and p are parameters of the specific documents collection
Estimating Unknown ActualDF (¢) Frequencies(3) • Sort words in descending order of their SampleDF(¢) frequencies • to determine the rank ri of each word wi. • Focus on words with known ActualDF (¢) frequencies. • Use the SampleDF-based rank and ActualDF frequencies to find the P, B, and p parameter values that best fit the data.
Estimating Unknown ActualDF (¢) Frequencies(4) • Estimate ActualDF (wi) • for all words wi with unknown ActualDF (wi) as P(ri+p) -B, • where ri is the rank of word wi as computed in Step 1.
A Database Selection Algorithm that Exploits the Database Categorization and Content Summaries Selection • “Propagate” the database content summaries • to the categories of the hierarchical classification scheme • Use the content summaries of categories and databases • to perform database selection hierarchically by zooming in on the most relevant portions of the topic hierarchy
Creating Content Summaries for Topic Categories • Assumption – • Databases classified under similar topics tend to have similar vocabularies. • Problem – • Database selection algorithms might produce inaccurate conclusions for queries with one or more words missing from relevant content summaries.
Creating Content Summaries for Topic Categories (2) • Solution – • Associate content summaries with the categories of the topic hierarchy used by the probing algorithm. • Treat each category as a large “database” and perform database selection hierarchically
Selecting Database Hierarchically • The algorithm chooses the best databases for a query. • By exploiting the database categorization, this hierarchical algorithm manages to • Compensate for the necessarily incomplete database content summaries produced by query probing
Selecting the K most specific databases for a query hierarchically HierSelect(Query Q, Category C, int K) 1: Use a database selection algorithm to assign a score for Q to each subcategory of C 2: if there is a subcategory C with a non-zero score 3: Pick the subcategory Cj with the highest score
Selecting the K most specific databases for a query hierarchically(2) 4: if NumDBs(Cj) >= K //Cj has enough databases 5: return HierSelect(Q,Cj ,K) 6: else // Cj does not have enough databases 7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj)) 8: else // no subcategory C has non-zero score 9: return FlatSelect(Q,C,K)