170 likes | 349 Views
PageRank. Un Motor de Búsqueda. “ obama ”. PageRank Model: Final Version. The Web: a directed graph. Edges ( links ). Vertices ( pages ). f. a. e. b. d. c. Input Structure. 41.5 million edges 5.4 million nodes document-with-link document-linked. Step 1. Dictionary Encode Links.
E N D
PageRank Model: Final Version • The Web: a directed graph Edges (links) Vertices (pages) f a e b d c
Input Structure • 41.5 million edges • 5.4 million nodes document-with-link document-linked
Step 1. Dictionary Encode Links • Strings difficult to fit in memory • Encode strings as OIDs (object ids = integers) • Input line: http://es.dbpedia.org/resource/Ciencia_ficción http://es.dbpedia.org/resource/Robot • Output line: • 52673 • Dictionary: • http://es.dbpedia.org/resource/Ciencia_ficción … 52673 http://es.dbpedia.org/resource/Robot … • OIDCompress -i[folder]/page_links_es.tsv.gz -igz -o [folder]/page_links_es.oid.gz -ogz -d [folder]/page_links_es.dict.gz -dgz
Step 2. Write PageRank Algorithm • PageRankGraph.rankGraph(int[][] graph) • int[] out = graph[i]; • out contains the nodes linked from node i • it might be empty or null if node i doesn’t link to anything! • two rank vectors: rank[graph.length], nextRank[graph.length] • initial rank values set as 1d / graph.length • run ITERS number of iterations • compute edge-invariant rank once per iteration (red and blue) • need to keep track of sum of ranks of nodes with no outlinksfrom prev. round • for each node (orange) • split it’s rank[] by the number of outlinks it has, and add the result to the nextRank[] of each node it links to • the sum of the ranks after each round should be very very close to 1 • test on –idata/test-graph.txt –o data/test-data.txt
Step 3. Rank full data • Run ranking -i [folder]/page_links_es.oid.gz -igz -o [folder]/page_ranks_es.oid.gz –ogz • Sort by rank -i [folder]/page_ranks_es.oid.gz -igz -o [folder]/page_ranks_es_s.oid.gz –ogz • Decompress the file -d [folder]/page_links_es.dict.gz -dgz -i [folder]/page_ranks_es_s.oid.gz -igz -n 0 -o [folder]/page_ranks_es_s.tsv.gz -ogz
Course Marking • 45% for Weekly Labs (~3% a lab!) • 35% for Final Exam • 20% for Small Class Project
Class Project • Done in pairs (Except Alejandro :P) • Goal: Use what you’ve learned to do something cool (basically) • Expected difficulty: More than a lab’s worth • But from scratch / without my help! • Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness • Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! • Process: • Pair up (default random) by Wednesday • Decide on a topic (by June 9th) or let me assign one • If you need data or get stuck, I will (try to) help out • Deliverables: 10 minute presentation (June 23rd) & 4-page report