1 / 18

INF 141: Information Retrieval

INF 141: Information Retrieval. Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi. How to submit Answers For Assignment3.

hthole
Download Presentation

INF 141: Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF 141: Information Retrieval Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi

  2. How to submit Answers For Assignment3 • Create a PDF le containing your answer to general and extra credit questions. For the programming question, create a txt le containing the answers and a jar le containing the code. Put all les in a folder. Make the folder name < StudentID >{< StudentID >{< StudentID >{Assignment03}, zip it and submit it to EEE (Only one of the team members needs to submit the .zip file).

  3. Grading Assignment 1 & 2 • Come to my office at ICS1:408E (all team members) at one of the following time slots • Wednesday, Jan 19, 3:30 pm to 6 pm • Monday, Jan 24, 9 am to 12 pm • I will ask you to explain your algorithm, to run your code on a test input file. I might ask some other general questions related to these two assignments.

  4. Quiz 1 • Next week Jan 26, in the discussion class • Closed Book • All material covered in weeks 1, 2, 3.

  5. Assignment 3 • You can get it from http://www.ics.uci.edu/~sjavanma/IR/Assignments/Assignment3/ • Deadline Jan 30

  6. Crawler4j • http://code.google.com/p/crawler4j/ • Read the sample usage • Download • crawler4j-2.2.zip Unzip it and put the ‘lib’ folder in your java project. • crawler4j-dependencies-lib.zip *Unzip it add the .jar file to the ‘lib’ folder *Create a folder called ‘resources’ and put the .properties file in it

  7. The ‘lib’ & ‘resources’ folder

  8. Main Classes • Create two classes https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/simple/ • Controller • MyCrawler

  9. Controller Class • At this time you should see some errors Oops forgot to import the .jar files!

  10. Add External Jars Select All and press open and then OK

  11. Add Sources To The Classpath

  12. Controller:Setting The Parameters

  13. MyCralwer: Main Methods • shouldVisit(WebURL url) • Should I put this URL in frontier or not? • visit(Page page) • How should I process this page coming from the head of the frontier? • page.getWebURL().getURL(); • page.getHTML(); • page.getText(); • page.getURLs() • Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java

  14. An Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java

  15. Content Articles public static boolean isArticle(String titlePartOfUrl) { if (titlePartOfUrl.startsWith("Image:") || titlePartOfUrl.startsWith("Wikipedia:") || titlePartOfUrl.startsWith("Category:")|| titlePartOfUrl.startsWith("Special:") || titlePartOfUrl.startsWith("Image_talk:")|| titlePartOfUrl.startsWith("Portal:")|| titlePartOfUrl.startsWith("Wikipedia_talk:") || titlePartOfUrl.startsWith("User:")|| titlePartOfUrl.startsWith("Template:")|| titlePartOfUrl.startsWith("Template_talk:") || titlePartOfUrl.startsWith("Help:")|| titlePartOfUrl.startsWith("Talk:")|| titlePartOfUrl.startsWith("User_talk:") || titlePartOfUrl.startsWith("Category_talk:") || titlePartOfUrl.startsWith("Media:")|| titlePartOfUrl.startsWith("MediaWiki:") || titlePartOfUrl.startsWith("File:") || titlePartOfUrl.startsWith("MediaWiki_Talk:")) {return false;} return true;} http://en.wikipedia.org/wiki/Bing_search_engine http://en.wikipedia.org/wiki/Category:Bing

  16. Main Questions To Answer • How to count unique terms, what data structure? • How to write the result in file(s)? • How to solve concurrency problems that might happen? • Static • Synchronized • Atomic Integer

  17. Example: IO • I have only one file and all threads(crawlers) write in it • Each thread has its own file and I merge all files when the threads threads(crawlers) are done.

  18. Sample Code Snippet private static PrintStream out; static { try { out = new PrintStream("/home/sara/Wikipedia2/Train-Test-Features/testUserPageStatus.txt"); } catch (FileNotFoundException e) { e.printStackTrace(); }} public MyCrawler() { }

More Related