1 / 51

Crawling and Ranking

Explore HTML structure, tags, and validation for improved website ranking. Learn about HTTP, status codes, authentication, cookies, and crawling basics. Understand duplicates, ethics, and crawling algorithms.

Download Presentation

Crawling and Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling and Ranking

  2. HTML (HyperText Markup Language) • Described the structure and content of a (web) document • HTML 4.01: most common version, W3C standard • XHTML 1.0: XML-ization of HTML 4.01, minor differences • Validation (http://validator.w3.org/) against a schema. Checks the conformity of a Web page with respect to recommendations, for accessibility: • to all graphical browsers (IE, Firefox, Safari, Opera, etc.) • to text browsers (lynx, links, w3m, etc.) • to all other user agents including Web crawlers

  3. The HTML language • Text and tags • Tags define structure • Used for instance by a browser to lay out the document. • Header and Body

  4. HTML structure <!DOCTYPE html …> <html lang="en"> <head> <!-- Header of the document --> </head> <body> <!-- Body of the document --> </body> </html>

  5. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN“ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns=http://www.w3.org/1999/xhtml lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /> <title>Example XHTML document</title> </head> <body> <p>This is a <a href="http://www.w3.org/">link to the W3C</a></p> </body> </html>

  6. Header • Appears between the tags <head> ... </head> • Includes meta-data such as language, encoding… • Also include document title • Used by (e.g.) the browser to decipher the body

  7. Body • Between <body> ... </body> tags • The body is structured into sections, paragraphs, lists, etc. <h1>Title of the page</h1> <h2>Title of a main section</h2> <h3>Title of a subsection</h3> . . . • <p> ... </p> define paragraphs • More block elements such as table, list…

  8. HTTP • Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK • Two main HTTP methods: GET and POST

  9. GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com

  10. POST • Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-www-formurlencoded Content-Length: 100 …

  11. Status codes • HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) • First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error

  12. Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. • It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp

  13. Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID • Connected, on the server side, to all session information

  14. Crawling

  15. Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web • Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!

  16. Discovering new URLs • Browse the "internet graph" (following e.g. hyperlinks) • Site maps (sitemap.org)

  17. The internet graph • At least 14.06 billion nodes = pages • At least 140 billion edges = links • Lots of "junk"

  18. Graph-browsing algorithms • Depth-first • Breath-first • Combinations.. • Parallel crawling

  19. Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing • Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto • Exact duplicates: identification by hashing • near-duplicates: (timestamps, tip of the day, etc.) more complex!

  20. Near-duplicate detection • Edit distance • Good measure of similarity, • Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams

  21. Crawling ethics • robots.txt at the root of a Web server • User-agent: * Allow: /searchhistory/ Disallow: /search • Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> • Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> • Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server

  22. Overview • Crawl • Retrieve relevant documents • Can you guess how? • To define relevance, to find relevant docs.. • We will discuss later • Rank

  23. Ranking

  24. Why Ranking? • Huge number of pages • Huge even if we filter according to relevance • Keep only pages that include the keywords • A lot of the pages are not informative • And anyway it is impossible for users to go through 10K results

  25. When to rank? • Before retrieving results • Advantage: offline! • Disadvantage: huge set • After retrieving results • Advantage: smaller set • Disadvantage: online, user is waiting..

  26. How to rank? • Observation: links are very informative! • Not just for discovering new sites, but also for estimating the importance of a site • CNN.com has more links to it than my homepage… • Quality and Efficiency are key factors

  27. Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from less-important sites A(v) = The authority of v • Hubness A good hub is a site that links to many authoritative sites H(v) = The hubness of v

  28. HITS • Recursive dependency: a(v) = Σ(u,v) h(u) h(v) = Σ(v,u) a(u) Normalize (when?) according to square root of sum of squares of authorities \ hubness values • Start by setting all values to 1 • We could also add bias • We can show that a(v) and h(v) converge

  29. HITS (cont.) • Works rather well if applied only on relevant web pages • E.g. pages that include the input keywords • The results are less satisfying if applied on the whole web • On the other hand, online ranking is a problem

  30. Google PageRank • Works offline, i.e. computes for every web-site a score that can then be used online • Extremely efficient and high-quality • The PageRank algorithm that we will describe here appears in [Brin & Page, 1998]

  31. Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it • A link is chosen with uniform distribution • A simplifying assumption.. • What is the probability of being, at a random time, at a web-page W?

  32. Recursive definition • If PageRank reflects the probability of being in a web-page (PR(w) = P(w)) then PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W

  33. Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem • Many Web pages have no inlinks/outlinks

  34. Damping Factor • Add some probability d for "jumping" to a random page • Now PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

  35. How to compute PR? • Simulation • Analytical methods • Can we solve the equations?

  36. Simulation: A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the web-pages visited

  37. Convergence • Not guaranteed without the damping factor! • (Partial) intuition: if unlucky, the algorithm may get stuck forever in a connected component • Claim: with damping, the probability of getting stuck forever is 0 • More difficult claim: with damping, convergence is guaranteed

  38. Markov Chain Monte Carlo (MCMC) • A class of very useful algorithms for sampling a given distribution • We first need to know what is a Markov Chain

  39. Markov Chain • A finite or countably infinite state machine • We will consider the case of finitely many states • Transitions are associated with probabilities • Markovianproperty: given the present state, future choices are independent from the past

  40. MCMC framework • Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution • Perform a random walk on the MC, keeping track of the proportion of state visits • Discard samples made before “Mixing” • Return proportion as an approximation of the correct distribution

  41. Properties of Markov Chains • A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time) • We want conditions on when this distribution is unique, and when will a random walk approximate it

  42. Properties • Periodicity • A state i has period k if any return to state i must occur in multiples of k time steps • Aperiodic: period = 1 for all states • Reducibility • An MC is irreducible if there is a probability 1 of (eventually) getting from every state to every state • Theorem: A finite-state MC has a unique stationary distribution if it is aperiodic and irreducible

  43. Back to PageRank • The MC is on the graph with probabilities we have defined • MCMC is the random walk algorithm • Is the MC aperiodic? Irreducible? • Why?

  44. Problem with MCMC • In general no guarantees on convergence time • Even for those “nice” MCs • A lot of work on characterizing “nicer” MCs • That will allow fast convergence • In practice for the web graph it converges rather slowly • Why?

  45. A different approach • Reconsider the equation system PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N • A linear equation system!

  46. Transition Matrix T= (0 0.33 0.33 0.33 0 0 0.5 0.5 0.25 0.25 0.25 0.25 0 0 0 0) Stochastic matrix

  47. EigenVector! • PR (column vector) is the right eigenvector of the stochastic transition matrix • I.e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique under the same assumptions as before

  48. Direct solution • Solving the equations set • Via e.g. Gaussian elimination • This is time-consuming • Observation: the matrix is sparse • So iterative methods work better here

  49. Power method • Start with some arbitrary rank vector R0 • Compute Ri = A Ri-1 • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)

  50. Power method (cont.) • Every iteration is still “expensive” • But since the matrix is sparse it becomes feasible • Still, need a lot of tweaks and optimizations to make it work efficiently

More Related