Web Data Management: Rankings & PageRank Algorithms

(A taste of)Data Management Over the Web

Web R&D • The web has revolutionized our world • Relevant research areas include databases, networks, security… • Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. • Lots of research in each of these directions • Specialized conferences for web research • Lots of companies • This course will focus on Web Data

Web Data • The web has revolutionized our world • Data is everywhere • Web pages, images, movies, social data, likes and dislikes… • Constitutes a great potential • But also a lot of challenges • Web data is huge, not structured, dirty.. • Just the ingredients of a fun research topic!

Ingredients • Representation & Storage • Standards (HTML, HTTP), compact representations, security… • Search and Retrieval • Crawling, inferring information from text… • Ranking • What's important and what's not • Google PageRank, Top-K algorithms, recommendations…

Challenges • Huge • Over 14 Billions of pages indexed in Google • Unstructured • But we do have some structure, such as html links, friendships in social networks.. • Dirty • A lot of the data is incorrect, inconsistent, contradicting, just irrelevant..

Course Goal • Introducing a selection of fun topics in web data management • Allowing you to understand some state-of-the-art notions, algorithms, and techniques • As well as the main challenges and how we approach them

Course outline Ranking: HITS and PageRank • Data representation: XML; HTML • Crawling • Information Retrieval and Extraction, Wikipedia example • Aggregating ranks and Top-K algorithms • Recommendations, Collaborative Filtering for recommending movies in NetFlix • Other topics (time permitting): Deep Web, Advertisements… • The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech

Course requirement A small final project Will involve understanding of 2 or 3 of the subjects studied and some implementation Will be given next monday

Ranking

Why Ranking? • Huge number of pages • Huge even if we filter according to relevance • Keep only pages that include the keywords • A lot of the pages are not informative • And anyway it is impossible for users to go through 10K results

How to rank? • Observation: links are very informative! • Instead of a collection of Web Pages, we have a Web Graph!! • This is important for discovering new sites (see crawling), but also for estimating the importance of a site • CNN.com has more links to it than my homepage…

Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites A(v) = The authority of v • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites H(v) = The hubness of v

HITS (Kleinberg ’99) • Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u) Normalize according to sum of authorities \ hubness values • We can show that a(v) and h(v) converge

Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn)) Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi

Recursive definition • PageRank reflects the probability of being in a web-page (PR(w) = P(w)) Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve?

EigenVector! • PR (row vector) is the left eigenvector of the stochastic transition matrix • I.e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique if the matrix is irreducible • Can be guaranteed by small perturbations

Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem • Many Web pages have no outlinks

Damping Factor • Add some probability d for "jumping" to a random page • Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

How to compute PR? • Analytical methods • Can we solve the equations? • In principle yes, but the matrix is huge! • Not a realistic solution for web scale • Approximations

A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the web-pages visited • The frequency for each page converges to its PageRank

Power method • Start with some arbitrary rank row vector R0 • Compute Ri = Ri-1* A • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)

Other issues • Accelerating Computation • Distributed PageRank • Mixed Model (Incorporating "static" importance) • Personalized PageRank

XML

HTML (HyperText Markup Language) • Used for presentation • Standardized by W3C (1999) • Described the structure and content of a (web) document • HTML is an open format • Can be processed by a variety of tools

HTTP Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK Two main HTTP methods: GET and POST

GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com

POST Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-www-formurlencoded Content-Length: 100 …

Status codes HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error

Authentication HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp

Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID • Connected, on the server side, to all session information

Crawling

Basics of Crawling Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!

Discovering new URLs Browse the "internet graph" (following e.g. hyperlinks) Referrer urls Site maps (sitemap.org)

The internet graph At least 14.06 billion nodes = pages At least 140 billion edges = links

Graph-browsing algorithms Depth-ﬁrst Breath-first Combinations..

Duplicates Identifying duplicates or near-duplicates on the Web to prevent multiple indexing Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto Exact duplicates: identiﬁcation by hashing near-duplicates: (timestamps, tip of the day, etc.) more complex!

Near-duplicate detection • Edit distance • Good measure of similarity, • Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams

Crawling ethics robots.txt at the root of a Web server User-agent: * Allow: /searchhistory/ Disallow: /search Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server

Web Data Management: Rankings & PageRank Algorithms

Web Data Management: Rankings & PageRank Algorithms

Presentation Transcript

SWAAD 2005 – A Taste of India

Unit3 A taste of English Humour

TASTE

A Taste of Isalm

SWAAD 2005 – A Taste of India

A Taste of Cositutti

A Taste of Visual Studio 2005

A TASTE OF CHAOS

Россия – Russia: A Taste of Culture

TASTE OF INK

Taste

Taste of Italy

A Taste of Chemistry

Taste of Korea

A taste of Palestinian culture

Unit 3 A taste of English humour

A Taste of India

A Taste Of Summer

A TASTE FOR TROUBLE

A Taste of Extreme Nonlinear Optics

A Taste of Korea: Making Bibimbap

Taste