1 / 34

Policy Search for Focused Web Crawling

Policy Search for Focused Web Crawling. Charlie Stockman NLP Group Lunch August 28, 2003. Outline. Focused Crawling Reinforcement Learning Policy Search Results Discussion and Future Work. Focused Crawling.

media
Download Presentation

Policy Search for Focused Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Policy Search for Focused Web Crawling Charlie Stockman NLP Group Lunch August 28, 2003

  2. Outline • Focused Crawling • Reinforcement Learning • Policy Search • Results • Discussion and Future Work

  3. Focused Crawling • Web crawling is an automated searching of the web, usually with the purpose of retrieving pages for a search engine. • Large search engines like Google attempt to visit as many text pages as possible using strategies like breadth first search. • Focused crawling aims at only searching promising parts of the web where specific targets will be found. • Internet portals like Citeseer might use focused crawling because they aim at only answering domain specific queries.

  4. Why Use Focused Crawling? • Why don’t we just crawl the entire web? • There is a lot of data to save and index. (3 billion plus pages according to Google) • Crawling takes time, and you want to be able to revisit relevant pages that might change. (Our crawler has limited resources and crawls ~1 million pages a day. It would take us about 10 years to crawl the entire web.) • Why don’t we just use Google to search for our targets? • Google won’t allow you to make millions of queries a day to its search engine.

  5. Previous Approaches • Cho, Garcia-Molina, and Page (Computer Networks ‘98) • Uses Backlink and PageRank(weighted Backlink) metrics to decide which URLs to pursue. • Performed well when Backlink was used to measure the importance of a page. • Performed poorly for a similarity based performance measure (eg. Whether the page was computer related.) • Chakrabarti, Van den Berg, and Dom (Computer Networks ‘99). • Works on the assumption that good pages are grouped together. • User-supplied hierarchical tree of categories like Yahoo!. • Use a bag-of-words classification to put each downloaded page in a category. • Neighbors and children of this page are pursued depending upon whether this category or one of its ancestors has been marked as good. • Burden on user to provide this tree and example documents. • Rennie and McCallum - Cora portal for CS papers • Uses Reinforcement Learning.

  6. Reinforcement Learning • An RL task consists of: • a finite set of states, • a finite set of actions, • a transition function, • A reward function, • The Goal of RL is to find a policy, that maximizes the expected sum of future rewards. r=0 r=0 r=1

  7. Value Functions in RL • In the infinite horizon model the value of a state, s, is equal to: • This value function can be defined recursively as: • The value function, , is defined as the value of a state using the optimal policy, . • It is also convenient to define the function, , to be the value of taking action a in state s and acting optimally from then on. where is a discount factor

  8. RL in Focused Crawling • Naïve approach: • States are web pages, actions are links, and the transition function is deterministic • Inaccurate model of focused crawling • Better approach: • A state s is defined by the set of pages we have visited (and the set not visited) • Our set of actions, A, is the set of links we have seen but have not visited. • Our reward function, R, for an action a is positive if a is downloading a target page, and negative or zero if not. • Our transition function, T, adds the page we are following to the set of seen pages, and the links on the page to our action set. • If we knew our optimal policy would be simply to pick the action with the highest Q value. visited pages not visited targets

  9. Problems with RL • Note that the Q function is specified as a table of values, one for each state-action pair • In order to compute Q* we would have to continuously iterate over all of the states and actions. • Problem – Our goal is to not visit all web pages and links, and the web is too large for this even if we did. • We could use a function of the features of a state-action pair to approximate the true Q function, and use machine learning to learn this function during the crawl. • Problem – RL with value function approximation is not proven to converge to the true values. • We could download a dataset, compute something close to the exact Q function, and then learn an approximation to it, and use the approximation function in our crawls. • This is what Rennie and McCallum do.

  10. Rennie and McCallum • They compute the Q-values associated with a near-optimal policy on a downloaded training set. • Their near optimal policy selects actions from a state that lead to the closest target page. • This makes sense. You can always “jump” to any other page that you’ve seen. • They then train a Naïve Bayes classifier, using the binned Q value of a link as the class, and the text, context, url, etc. as features. 2 1 Near optimal policy The sequence of rewards if you pursue the immediate reward first is {1,0,1,1,1,1} while it is {0,1,1,1,1,1} if you go to the “hub” first.

  11. Rennie and McCallum Problems • Rennie and McCallum require that the user select and supply large training sets. • Q function is binned • Near optimal policy is not optimal. • It misses the fact that some links lead to a single target k steps away, while others lead to multiple targets k steps away • It chooses arbitrarily between two equidistant rewards. Thus, it could make the wrong decision if one of the pages has more rewards right behind it.

  12. Policy Search • Approximating the value function is difficult. • The exact value of a state is hard to approximate, and more information than we need. • Approximating the policy is easier. • All we need to learn is whether or not to take a certain action. • However: • The space of possible policies is large. For a policy that is a linear function of features, there are as many parameters as features • We can’t do hill-climbing in policy space. • Therefore: • We need at very least the gradient of the performance function with respect to the parameters of the policy, so we know in which direction to search. • So… • We need a probabilistic model of focused crawling, where we can state the performance of a policy as a differentiable function of the policy parameters.

  13. Probabilistic Model • Joint probability distribution over random variables representing the status of each page at each iteration of the crawl. • N rows for the N pages in the web. • T columns for the T steps away from the start page you travel. • Each variable sit can take the value 1, meaning that the page i was visited before time t, or 0 • Dynamic Bayesian network (DBN) provides a compact representation • A node has parents in the previous column only; the parents are the nodes corresponding to the web pages that link to it, as well as itself (to ensure that a page remains visited) • Whether a page has been visited depends on whether its parents have been visited by the previous step. 1 2 3 4 N 1 T-1 2 T

  14. Probability of Visiting a Page • The probability of that a node i is visited given that its parent is visited is represented as wij. • wij is calculated as a logistic regression function of the dot product of the parameters, , and the features from j to i, Fji. • We use a Noisy-OR conditional probability distribution over parents for efficiency of representation and computation. p1 w31 p2 w33 p3 p3 w34 p4 t-1 t

  15. Probability of Visiting (cont) • Noisy-OR Approximation Product over all the parents, j, in the last time step. Probability that i is not visited in this time segment if j has been visited. uj = 1 if the parent j has been visited by the previous time step. uj = 0 otherwise.

  16. Performance of Policy • The performance of a policy is computed at the last time step. • Equal to the sum of the probabilities of visiting each page multiplied by the reward of that page. • Possible rewards. 1 .5 2 .2 3 1 4 .3 N .4 ri = positive if si is a target. ri = negative or zero if si is not a target. T

  17. Computing the Gradient Chain rule of partial derivatives Sensitivity of performance function with respect to a specific node.

  18. Gradient of the Policy • Computing this derivative is fairly trivial. • The logistic function has a nice derivative. • We have already have everything we need here. • We have the parameters, . • We have the features, Fji. • And we have wij , we calculate it when we go forward through the network and create all of the probabilities and get a value for our policy. • Remember, wij is the same at every time step.

  19. Gradient of a Marginal Prob • Calculating this derivative is less trivial. • But with a little bit of math. • Again we have everything we need • We calculate all of the probabilities and weights with our forward run through the network.

  20. Computing the sensitivity • For the time step T, this derivative is simple. • Recall that: • So: • For all other time steps. • Note here that I have changed the i on the left side to a j. This is because the sensitivity of the node j is determined by the sensitivity of its children, i. • Thus when computing the gradient, we only have to keep the sensitivities for one layer back. t-1 t t+1

  21. We Have the Gradient! • Because we just have to go once forward and once backward through the network, this process is of complexity O(N*T*P*F) • N~=50,000 • T~=10 • P~=10 (number of parents) • F~=10,000 (number of features) • ~50 billion calculations. • Training on a set of 15,000 pages with 60,000 features took about a minute.

  22. Results

  23. Results • Performance of policy search isn’t yet beating that of Rennie and McCallum. • But we believe that we can improve performance vastly by adjusting algorithm. • Larger training sets • Different parameterizations • Better optimization techniques • Better feature selection • Optimize with respect to alternative performance metrics.

  24. Discussion and Future Work • We still need to play around with it to see how much we can get its performance to improve. • It is still an offline algorithm. We would like to do an online algorithm of some sort. • Batch processing algorithm where we crawl for a while, collecting a new data set, and then learn for a while. • We would have to throw in some kind of random exploration if that were the case. • Calculate gradient and search parameters online?

  25. Backup Slides

  26. Second Part of Sum

  27. Second Part of Sum (cont.)

  28. Second Part of Sum (cont.)

  29. Reinforcement Learning

More Related