170 likes | 311 Views
Automatic Evaluation Of Search Engines Project Presentation. Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim. Introduction. Search engines have become a very popular and important tool How can we compare different search engines?
E N D
Automatic Evaluation Of Search EnginesProject Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim
Introduction • Search engines have become a very popular and important tool • How can we compare different search engines? • Coming up with a set of tests which are absolute for all the engines. • One way - Randomly sampling results from search engines and then comparing them. • In this project we will implement two algorithms - the Metropolis-Hastings (MH) and the Maximum Degree algorithm (MD) for doing just that.
Background • Bharat and Broder proposed a simple algorithm for uniformly sampling documents from a search. • The algorithm formulates “random” queries, submits and picks uniformly chosen documents from the result sets. • We present another sampler, the random walk sampler • This sampler performs a random walk on a virtual graph defined over the documents. • The algorithm first produces biased samples - some documents are more likely to be sampled than others. • The two algorithms we implement come to fix this bias issue.
Maximum Degree Algorithm - MD Shown below is a pseudo code for the accept function of the MD algorithm: • 1:Function accept(P, C, x) • 2: rMD(x) := p(x)/{C π(x)} • 3: toss a coin whose heads probability is rMD (x) • 4: return true if and only if coin comes up heads • The algorithm works by adding self loops to nodes. • Causing random walk to stay at these WebPages (nodes). • And by that fixing the bias in the trial distribution.
Metropolis-Hastings algorithm - MH Shown below is a pseudo code for the accept function of the MH algorithm: • degP(x)= |queriesP(x)| • 1:Function accept( x,y) • 2: rMH(x,y) := min{ (π(y) degP(x)) / (π(x)degP(y)),1 } • 3: toss a coin whose heads probability is rMH (x,y) • 4: return true if and only if coin comes up heads • The Algorithm gives preference to smaller documents by reducing the step probability to large documents. • This fixes the bias caused by large documents with a large number of pareses.
Project Description The project consisted of the following stages: • Intro and learning to use the Yahoo Interface • Implementing the RW Algorithms • Designing the Simulation Frame Work • Analyzing and displaying the results
Software Design Decisions • The System class diagram
The web sampler Design and use The WebSampler class is the main implementation of the two random walk algorithms. mHRandomWalker – function The basic flow of the function is: Initializing the system parameters . Parsing shingles for the initial URL Sampling and finding the next URL. Calculating the shingles for the next URL. Calculating whether or not we’re staying in the current URL by the probability function of the MH algorithm. Calculating the similarity parameter (will be discussed later on). Writing to the parameters to the StepInfo data Structure.
The main sampler Design and use • The mainSampler class is the main class for running the random walk simulation • This class reads parameters from the command line • opens Threads which run the MD or the MH random walker function • At the end of each run it calls the printXMLResults function to save the results
Auxiliary classes Design and use Constants This class holds a list of predefined parameters: • phrasesLenght • String url[] – an array of initialled URL’s used in the simulation we made. • Index depth parameter StepInfo • A data structure we defined for holding all of the simulation parameters SamplingData • An auxiliary data structure that holds an array list of StepInfo.
The results analyzer Design and use • The result analyzer is a module for drawing our simulation data and present them. • Outputs data regarding similarity, and other statistical RW paramters. • Computes and displays data regarding domain distributions • Recieves XML files with the simulation result. • Outputs various .csv files according to the result set needed.
Designing the Simulation Frame Work • Planning a series of simulations testing different parameters of the algorithms • Considering “bottlenecks” like the Yahoo daily query limit and H.D space. • Measuring the effect of each parameter on the algorithm • Running the simulations at the software lab on several computers at a time
Simulation Parameters • Phrases length – number of words parsed from the text • Initial URL – starting URL • Method – MD or MH
Results – Similarity vs. number of steps MH ,starting URL CNN
Results – Similarity vs. number of steps MH at different initial URL’s
Conclusions • The lower the phrase length, the lower the convergence step • Shorter phrase length->higher number of queries sent to the search engine • Trade-off between the query efficiency and total number of steps • Optimal initial Urls (out of 5 measured) – CNN, Technion. • The optimal method it terms of total number of queries to achieve similarity convergence is MH • In terms of TV- distance both methods show very similar results