230 likes | 376 Views
Agenda. Overview of the project Resources. CS172 Project. c rawling. indexing. ranking. Second Phase. First Phase. Phase 1 Options. Web data Needs to come out with your own crawling strategy Twitter data Can use third-party for Twitter Streaming API Still needs some web crawling.
E N D
Agenda • Overview of the project • Resources
CS172 Project crawling indexing ranking Second Phase First Phase
Phase 1 Options • Web data • Needs to come out with your own crawling strategy • Twitter data • Can use third-party for Twitter Streaming API • Still needs some web crawling
Crawling getNext Download contents of page 1 • Frontier • www.cs.ucr.edu • www.cs.ucr.edu/~vagelis Parse the downloaded file to extract links the page 2 getNext() addAll(List) Clean and Normalize the extracted links 3 Add(List<URLs>) Store extracted links in the Frontier 4
2. Parsing HTML to extract links <- This is what you will see when you download a page. Notice HTML Tags.
2. Parsing HTML file • Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods • SAX (Simple API for XML) • DOM (Document Object Model) • Use existing library • JSoup(http://jsoup.org/). Can be used to download the page. • HTML Parser (http://htmlparser.sourceforge.net/)
2. Parsing HTML file • Things to think about • How to handle Malformed HTML? Browser can still display it, but how do you handle it?
3. Clean extracted URLs • Some URL entries while crawling www.cs.ucr.edu /intranet/ /inventthefuture.html systems.engr.ucr.edu news/e-newsletter.html http://www.engr.ucr.edu/sendmail.html http://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html /faculty/ / /about/ #main http://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104
3. Clean extracted URLs What to avoid • Parse only http links (avoid ftp, https or any other protocol) • Avoid duplicates • Bookmarks : #main – Bookmarks should be stripped off. • Self paths: / • Avoid downloading pdfs or images • /news/GraphenePublicationsIndex.pdf • Its ok to download them, but you cannot parse them. • Take care of invalid characters in URLs • Space: www.cs.ucr.edu/vagelishristidis • Ampersand: www.cs.ucr.edu/vagelis&hristidis • These characters should be encoded else you will get a MalformedURLException
Normalize Links Found on the page • Relative URLs: • These URLs have no host address • E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such as: • Case 1: /find_people.php • A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case. • Case 2: all • No “/” means the path is relative to current path. • Normalize them (respectively) to • www.cs.ucr.edu/find_people.php • www.cs.ucr.edu/faculty/all
Clean extracted URLs • Different Parts of the URL highlighted with different colors • http://www.pe.com:8080/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533 • Protocol • Port • Host • Path • Query • Bookmark
java.net.URL Has methods that can separate different parts of the URL. getProtocol: http getHost: www.pe.com getPort: -1 getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece getQuery: ssimg=532988 getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988
Normalizing with java.net.URL • You can normalize URLs with simple string manipulations and using methods from java.net.URL class. • Here is the snippet for normalizing “Case 1” root relative URLs
Crawler Ethics • Some websites don’t want crawlers swarming all over them. • Why? • Increases load on the server • Private websites • Dynamic websites • …
Crawler Ethics • How does the website tell you (crawler) if and what is off limits. • Two options • Site wide restrictions: robots.txt • Webpage specific restrictions: Meta tag
Crawler Ethicsrobots.txt • A file called “robots.txt” in the root directory of the website • Example: http://www.about.com/robots.txt • Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths>
Crawler Ethicsrobots.txt • What should you do? • Before starting on a new website: • Check if robots.txt exists. • If it does, download it and parse it for all inclusions and exclusions for “generic crawler” i.e. User-Agent: * • Don’t’ crawl anything in the exclusion list including sub-directories
Crawler EthicsWebsite Specific: Meta tags • Some webpages have one the following meta-tag entries: • <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> • <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> • Options: • INDEX or NOINDEX • FOLLOW or NOFOLLOW
Twitter data collecting • Collecting through Twitter Streaming API • https://dev.twitter.com/docs/platform-objects/tweets, where you can check the data schema. • Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) • You need to have a Twitter account for that. Check https://dev.twitter.com/
Third-party libarary Twitter4j for Java. • You can find supports for other languages also. • Well documented and code examples. e.g., http://twitter4j.org/en/code-examples.html
Important Fields • At least following fields you should save: • Text • Timestamp • Geolocation • User of the tweet • Links
Crawl links in Tweets • Tweets may contain links. • It may contains useful information. E.g., links to news articles. • After collect the tweets, use another process to crawl the links. • Because the crawling is slower, so you may not want to crawl it right after you get the tweet.