380 likes | 444 Views
(A taste of) Data Management Over the Web. Web R&D. The web has revolutionized our world Relevant research areas include databases, networks, security… Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design..
E N D
Web R&D • The web has revolutionized our world • Relevant research areas include databases, networks, security… • Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. • Lots of research in each of these directions • Specialized conferences for web research • Lots of companies • This course will focus on Web Data
Web Data • The web has revolutionized our world • Data is everywhere • Web pages, images, movies, social data, likes and dislikes… • Constitutes a great potential • But also a lot of challenges • Web data is huge, not structured, dirty.. • Just the ingredients of a fun research topic!
Ingredients • Representation & Storage • Standards (HTML, HTTP), compact representations, security… • Search and Retrieval • Crawling, inferring information from text… • Ranking • What's important and what's not • Google PageRank, Top-K algorithms, recommendations…
Challenges • Huge • Over 14 Billions of pages indexed in Google • Unstructured • But we do have some structure, such as html links, friendships in social networks.. • Dirty • A lot of the data is incorrect, inconsistent, contradicting, just irrelevant..
Course Goal • Introducing a selection of fun topics in web data management • Allowing you to understand some state-of-the-art notions, algorithms, and techniques • As well as the main challenges and how we approach them
Course outline Ranking: HITS and PageRank • Data representation: XML; HTML • Crawling • Information Retrieval and Extraction, Wikipedia example • Aggregating ranks and Top-K algorithms • Recommendations, Collaborative Filtering for recommending movies in NetFlix • Other topics (time permitting): Deep Web, Advertisements… • The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech
Course requirement A small final project Will involve understanding of 2 or 3 of the subjects studied and some implementation Will be given next monday
Why Ranking? • Huge number of pages • Huge even if we filter according to relevance • Keep only pages that include the keywords • A lot of the pages are not informative • And anyway it is impossible for users to go through 10K results
How to rank? • Observation: links are very informative! • Instead of a collection of Web Pages, we have a Web Graph!! • This is important for discovering new sites (see crawling), but also for estimating the importance of a site • CNN.com has more links to it than my homepage…
Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites A(v) = The authority of v • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites H(v) = The hubness of v
HITS (Kleinberg ’99) • Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u) Normalize according to sum of authorities \ hubness values • We can show that a(v) and h(v) converge
Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn)) Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi
Recursive definition • PageRank reflects the probability of being in a web-page (PR(w) = P(w)) Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve?
EigenVector! • PR (row vector) is the left eigenvector of the stochastic transition matrix • I.e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique if the matrix is irreducible • Can be guaranteed by small perturbations
Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem • Many Web pages have no outlinks
Damping Factor • Add some probability d for "jumping" to a random page • Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index
How to compute PR? • Analytical methods • Can we solve the equations? • In principle yes, but the matrix is huge! • Not a realistic solution for web scale • Approximations
A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the web-pages visited • The frequency for each page converges to its PageRank
Power method • Start with some arbitrary rank row vector R0 • Compute Ri = Ri-1* A • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)
Other issues • Accelerating Computation • Distributed PageRank • Mixed Model (Incorporating "static" importance) • Personalized PageRank
HTML (HyperText Markup Language) • Used for presentation • Standardized by W3C (1999) • Described the structure and content of a (web) document • HTML is an open format • Can be processed by a variety of tools
HTTP Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK Two main HTTP methods: GET and POST
GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com
POST Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-www-formurlencoded Content-Length: 100 …
Status codes HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error
Authentication HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp
Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID • Connected, on the server side, to all session information
Basics of Crawling Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!
Discovering new URLs Browse the "internet graph" (following e.g. hyperlinks) Referrer urls Site maps (sitemap.org)
The internet graph At least 14.06 billion nodes = pages At least 140 billion edges = links
Graph-browsing algorithms Depth-first Breath-first Combinations..
Duplicates Identifying duplicates or near-duplicates on the Web to prevent multiple indexing Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto Exact duplicates: identification by hashing near-duplicates: (timestamps, tip of the day, etc.) more complex!
Near-duplicate detection • Edit distance • Good measure of similarity, • Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams
Crawling ethics robots.txt at the root of a Web server User-agent: * Allow: /searchhistory/ Disallow: /search Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server