570 likes | 746 Views
Crawling and Ranking. HTML ( HyperText Markup Language). Described the structure and content of a (web) document HTML 4.01: most common version, W3C standard XHTML 1.0: XML-ization of HTML 4.01, minor differences
E N D
HTML (HyperText Markup Language) • Described the structure and content of a (web) document • HTML 4.01: most common version, W3C standard • XHTML 1.0: XML-ization of HTML 4.01, minor differences • Validation (http://validator.w3.org/) against a schema. Checks the conformity of a Web page with respect to recommendations, for accessibility: • to all graphical browsers (IE, Firefox, Safari, Opera, etc.) • to text browsers (lynx, links, w3m, etc.) • to all other user agents including Web crawlers
The HTML language • Text and tags • Tags define structure • Used for instance by a browser to lay out the document. • Header and Body
HTML structure <!DOCTYPE html …> <html lang="en"> <head> <!-- Header of the document --> </head> <body> <!-- Body of the document --> </body> </html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN“ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns=http://www.w3.org/1999/xhtml lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /> <title>Example XHTML document</title> </head> <body> <p>This is a <a href="http://www.w3.org/">link to the W3C</a></p> </body> </html>
Header • Appears between the tags <head> ... </head> • Includes meta-data such as language, encoding… • Also include document title • Used by (e.g.) the browser to decipher the body
Body • Between <body> ... </body> tags • The body is structured into sections, paragraphs, lists, etc. <h1>Title of the page</h1> <h2>Title of a main section</h2> <h3>Title of a subsection</h3> . . . • <p> ... </p> define paragraphs • More block elements such as table, list…
HTTP • Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK • Two main HTTP methods: GET and POST
GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com
POST • Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-www-formurlencoded Content-Length: 100 …
Status codes • HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) • First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error
Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. • It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp
Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID • Connected, on the server side, to all session information
Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web • Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!
Discovering new URLs • Browse the "internet graph" (following e.g. hyperlinks) • Site maps (sitemap.org)
The internet graph • At least 14.06 billion nodes = pages • At least 140 billion edges = links • Lots of "junk"
Graph-browsing algorithms • Depth-first • Breath-first • Combinations.. • Parallel crawling
Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing • Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto • Exact duplicates: identification by hashing • near-duplicates: (timestamps, tip of the day, etc.) more complex!
Near-duplicate detection • Edit distance • Good measure of similarity, • Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams
Crawling ethics • robots.txt at the root of a Web server • User-agent: * Allow: /searchhistory/ Disallow: /search • Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> • Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> • Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server
Overview • Crawl • Retrieve relevant documents • How? • To define relevance, to find relevant docs.. • Rank • How?
Relevance • Input: keyword (or set of keywords), “the web” • First question: how to define the relevance of a page with respect to a key word? • Second question: how to store pages such that the relevant ones for a given keyword are easily retrieved?
Relevance definition • Boolean based on existence of a word in the document • Synonyms • Disadvantages? • Word count • Synonyms • Disadvantages? • Can we do better?
Storing pages • Offline pre-processing can help online search • Offline preprocessing includes stemming, stop words removal… • As well as the creation of an index
More advanced text analysis • N-grams • HMM language models • PCFG langage models • We will discuss all that later in the course!
Why Ranking? • Huge number of pages • Huge even if we filter according to relevance • Keep only pages that include the keywords • A lot of the pages are not informative • And anyway it is impossible for users to go through 10K results
When to rank? • Before retrieving results • Advantage: offline! • Disadvantage: huge set • After retrieving results • Advantage: smaller set • Disadvantage: online, user is waiting..
How to rank? • Observation: links are very informative! • Not just for discovering new sites, but also for estimating the importance of a site • CNN.com has more links to it than my homepage… • Quality and Efficiency are key factors
Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from less-important sites A(v) = The authority of v • Hubness A good hub is a site that links to many authoritative sites H(v) = The hubness of v
HITS • Recursive dependency: a(v) = Σ(u,v) h(u) h(v) = Σ(v,u) a(u) Normalize (when?) according to square root of sum of squares of authorities \ hubness values • Start by setting all values to 1 • We could also add bias • We can show that a(v) and h(v) converge
HITS (cont.) • Works rather well if applied only on relevant web pages • E.g. pages that include the input keywords • The results are less satisfying if applied on the whole web • On the other hand, online ranking is a problem
Google PageRank • Works offline, i.e. computes for every web-site a score that can then be used online • Extremely efficient and high-quality • The PageRank algorithm that we will describe here appears in [Brin & Page, 1998]
Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it • A link is chosen with uniform distribution • A simplifying assumption.. • What is the probability of being, at a random time, at a web-page W?
Recursive definition • If PageRank reflects the probability of being in a web-page (PR(w) = P(w)) then PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W
Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem • Many Web pages have no inlinks/outlinks
Damping Factor • Add some probability d for "jumping" to a random page • Now PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index
How to compute PR? • Simulation • Analytical methods • Can we solve the equations?
Simulation: A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the web-pages visited
Convergence • Not guaranteed without the damping factor! • (Partial) intuition: if unlucky, the algorithm may get stuck forever in a connected component • Claim: with damping, the probability of getting stuck forever is 0 • More difficult claim: with damping, convergence is guaranteed
Markov Chain Monte Carlo (MCMC) • A class of very useful algorithms for sampling a given distribution • We first need to know what is a Markov Chain
Markov Chain • A finite or countably infinite state machine • We will consider the case of finitely many states • Transitions are associated with probabilities • Markovianproperty: given the present state, future choices are independent from the past
MCMC framework • Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution • Perform a random walk on the MC, keeping track of the proportion of state visits • Discard samples made before “Mixing” • Return proportion as an approximation of the correct distribution
Properties of Markov Chains • A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time) • We want conditions on when this distribution is unique, and when will a random walk approximate it
Properties • Periodicity • A state i has period k if any return to state i must occur in multiples of k time steps • Aperiodic: period = 1 for all states • Reducibility • An MC is irreducible if there is a probability 1 of (eventually) getting from every state to every state • Theorem: A finite-state MC has a unique stationary distribution if it is aperiodic and irreducible
Back to PageRank • The MC is on the graph with probabilities we have defined • MCMC is the random walk algorithm • Is the MC aperiodic? Irreducible? • Why?
Problem with MCMC • In general no guarantees on convergence time • Even for those “nice” MCs • A lot of work on characterizing “nicer” MCs • That will allow fast convergence • In practice for the web graph it converges rather slowly • Why?