210 likes | 381 Views
PageRank + Inverted Index. Un Motor de Búsqueda. “ obama ”. PageRank Model: Final Version. The Web: a directed graph . Edges ( links ). Vertices ( pages ). f. a. e. b. d. c. Input Structure. 31.5 million edges 960,109 nodes document-with-link document-linked.
E N D
PageRank Model: Final Version • The Web: a directed graph Edges (links) Vertices (pages) f a e b d c
Input Structure • 31.5 million edges • 960,109 nodes document-with-link document-linked
Step 0. Start Downloading Datasets • http://aidanhogan.com/teaching/cc5212-1/mdp-lab9-data/ • page_links_es_f.tsv.gz • wiki_abstracts_es.tsv.gz • http://aidanhogan.com/teaching/cc5212-1/mdp-lab9.zip
Step 1. Dictionary Encode Links • Strings difficult to fit in memory • Encode strings as OIDs (object ids = integers) • Input line: http://es.wikipedia.org/wiki/Ciencia_ficción http://es.wikipedia.org/wiki/Robot • Output line: • 52673 • Dictionary: • http://es.wikipedia.org/wiki/Ciencia_ficción … 52673 http://es.wikipedia.org/wiki/Robot … • OIDCompress -i[folder]/page_links_es_f.tsv.gz -igz-o [folder]/page_links_es_f.oid.gz -ogz-d [folder]/page_links_es_f.dict.gz -dgz
Step 2. Copy PageRank Code • Copy PageRankGraph.java from mdp-lab8 to mdp-lab9 (same package) • Use your code to be marked on it! • Marked from 20 for this lab • If you weren’t here last week, copy PageRankGraph.java from http://aidanhogan.com/cc5212-1/mdp-lab9-data/ • Marked from 10 for this lab
Step 3. Rank and sort full data • Run ranking (PageRankGraph.java) • 50 iterations: ITERS = 50 -i [folder]/page_links_es_f.oid.gz -igz -o [folder]/page_ranks_es_f.oid.tsv.gz –ogz • Sort ranks by rank score (SortByRank.java) -i [folder]/page_ranks_es_f.oid.tsv.gz -igz -o [folder]/page_ranks_es_f_s.oid.tsv.gz –ogz
Step 4. Make Predictions & Bets Which will be the highest ranked articles in Wikipedia according to PageRank?
Step 5. Decode the ranks • Decode the file (OIDDecompress.java) -d [folder]/page_links_es_f.dict.gz -dgz -i [folder]/page_ranks_es_f_s.oid.tsv.gz -igz -n 0 -o [folder]/page_ranks_es_f_s.tsv • Open the output in a text editor and have a look
Step 6. Copy Inverted Index Code • Copy IndexTitleAndAbstract.java and SearchIndex.java from mdp-lab7 into mdp-lab9 (if you were here) • Otherwise grab them from http://aidanhogan.com/cc5212-1/mdp-lab9-data/
Step 7. Rebuild Inverted Index • IndexTitleAndAbstract.java -i [folder]/wiki_abstracts_es.tsv.gz -igz -o [folder]/es_wiki_index/ • Try searches using SearchIndex.java • Copy the top 10 results for 5 searches including ‘obama’ and ‘universidad’ into a text file somewhere
Step 8. Add in the boost values • Open BoostRanks.java • Follow the board to code • Run: -o [folder]/es_wiki_index/ -i[folder]/page_ranks_es_f_s.tsv
Step 9. Profit • Re-run the same five queries as before over the boosted index and see if the results improve • http://www.lucenetutorial.com/lucene-query-syntax.html
Course Marking • 45% for Weekly Labs (~3% a lab!) • 35% for Final Exam • 20% for Small Class Project
Class Project • Done in pairs (Except Alejandro/Mauricio :P) • Goal: Use what you’ve learned to do something cool (basically) • Expected difficulty: More than a lab’s worth • But from scratch / without my help! • Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness • Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! • Process: • Pair up (default random) by Wednesday • Decide on a topic (by June 9th) or let me assign one • If you need data or get stuck, I will (try to) help out • Deliverables: 10 minute presentation (June 23rd) & 4-page report • 2 weeks!
Groups Pairings: • Catalina Espinoza y Felipe Quintanilla • Eduardo Acha y Jaime Salas • Francisca Concha y Nicolás Miranda Lone agents: • Alejandro Infante • Mauricio Quezada
Topics Let’s talk topics • Catalina Espinoza y Felipe Quintanilla • Eduardo Acha y Jaime Salas • Francisca Concha y Nicolás Miranda • Mauricio Quezada • What’s the idea? • What will be the result of your project? • How much data will you process/where will you source it? • Which techniques from the class will you use? • How cool is it?