Agenda

Agenda • Overview of the project • Resources

CS172 Project crawling indexing ranking Second Phase First Phase

Phase 1 Options • Web data • Needs to come out with your own crawling strategy • Twitter data • Can use third-party for Twitter Streaming API • Still needs some web crawling

Crawling getNext Download contents of page 1 • Frontier • www.cs.ucr.edu • www.cs.ucr.edu/~vagelis Parse the downloaded file to extract links the page 2 getNext() addAll(List) Clean and Normalize the extracted links 3 Add(List<URLs>) Store extracted links in the Frontier 4

1. Download File Contents

2. Parsing HTML to extract links <- This is what you will see when you download a page. Notice HTML Tags.

2. Parsing HTML file • Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods • SAX (Simple API for XML) • DOM (Document Object Model) • Use existing library • JSoup(http://jsoup.org/). Can be used to download the page. • HTML Parser (http://htmlparser.sourceforge.net/)

2. Parsing HTML file • Things to think about • How to handle Malformed HTML? Browser can still display it, but how do you handle it?

3. Clean extracted URLs • Some URL entries while crawling www.cs.ucr.edu /intranet/ /inventthefuture.html systems.engr.ucr.edu news/e-newsletter.html http://www.engr.ucr.edu/sendmail.html http://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html /faculty/ / /about/ #main http://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104

3. Clean extracted URLs What to avoid • Parse only http links (avoid ftp, https or any other protocol) • Avoid duplicates • Bookmarks : #main – Bookmarks should be stripped off. • Self paths: / • Avoid downloading pdfs or images • /news/GraphenePublicationsIndex.pdf • Its ok to download them, but you cannot parse them. • Take care of invalid characters in URLs • Space: www.cs.ucr.edu/vagelishristidis • Ampersand: www.cs.ucr.edu/vagelis&hristidis • These characters should be encoded else you will get a MalformedURLException

Normalize Links Found on the page • Relative URLs: • These URLs have no host address • E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such as: • Case 1: /find_people.php • A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case. • Case 2: all • No “/” means the path is relative to current path. • Normalize them (respectively) to • www.cs.ucr.edu/find_people.php • www.cs.ucr.edu/faculty/all

Clean extracted URLs • Different Parts of the URL highlighted with different colors • http://www.pe.com:8080/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533 • Protocol • Port • Host • Path • Query • Bookmark

java.net.URL Has methods that can separate different parts of the URL. getProtocol: http getHost: www.pe.com getPort: -1 getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece getQuery: ssimg=532988 getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988

Normalizing with java.net.URL • You can normalize URLs with simple string manipulations and using methods from java.net.URL class. • Here is the snippet for normalizing “Case 1” root relative URLs

Crawler Ethics • Some websites don’t want crawlers swarming all over them. • Why? • Increases load on the server • Private websites • Dynamic websites • …

Crawler Ethics • How does the website tell you (crawler) if and what is off limits. • Two options • Site wide restrictions: robots.txt • Webpage specific restrictions: Meta tag

Crawler Ethicsrobots.txt • A file called “robots.txt” in the root directory of the website • Example: http://www.about.com/robots.txt • Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths>

Crawler Ethicsrobots.txt • What should you do? • Before starting on a new website: • Check if robots.txt exists. • If it does, download it and parse it for all inclusions and exclusions for “generic crawler” i.e. User-Agent: * • Don’t’ crawl anything in the exclusion list including sub-directories

Crawler EthicsWebsite Specific: Meta tags • Some webpages have one the following meta-tag entries: • <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> • <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> • Options: • INDEX or NOINDEX • FOLLOW or NOFOLLOW

Twitter data collecting • Collecting through Twitter Streaming API • https://dev.twitter.com/docs/platform-objects/tweets, where you can check the data schema. • Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) • You need to have a Twitter account for that. Check https://dev.twitter.com/

Third-party libarary Twitter4j for Java. • You can find supports for other languages also. • Well documented and code examples. e.g., http://twitter4j.org/en/code-examples.html

Important Fields • At least following fields you should save: • Text • Timestamp • Geolocation • User of the tweet • Links

Crawl links in Tweets • Tweets may contain links. • It may contains useful information. E.g., links to news articles. • After collect the tweets, use another process to crawl the links. • Because the crawling is slower, so you may not want to crawl it right after you get the tweet.

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA