610 likes | 733 Views
BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database. 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw http://bermuda.citi.sinica.edu.tw. BIG Data & Twitter. WHAT IS BIG DATA ?.
E N D
BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw http://bermuda.citi.sinica.edu.tw
WHAT IS BIG DATA ? In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. 《Wikipedia Big data》 Source: http://en.wikipedia.org/wiki/Big_data
WHAT IS BIG DATA ? • In 2001, Doug Laney use 3V model to describe Big Data • Volume: amount of data • Velocity: speed of data in and out • Variety: range of data types and sources • Veracity: truth or fact of data
WHAT IS BIG DATA ? • In 2012, Gartner updated the definition • Still advocate 3V model for describing data • Require new forms of processing • Enhanced decision making • Insight discovery • Process optimization
HOW BIG IS BIG DATA ? • Beyond the ability of commonly used • A few dozen terabytes (107) to many petabytes (108) • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day
NEW TECHNOLOGY FOR BIG DATA • Hadoop • Developed by Apache Software Foundation • Derived from Google's MapReduce & File System • Able to process peta-bytes scale database • NoSQL (Not Only SQL) • Relational databases is not applicable for all cases • NoSQL is a new choose for non-relational databases • Adopted by Google, Facebook, Twitter, etc.
WHAT IS TWITTER? • The fastest, simplest way to communicate • More than 140M active users • Majority source from mobile • 60% of user is out of U.S. • More than 400M twitter.com visitors • More than 400M tweets/day (peak: 25K/sec) • 1,000 employees (majority in San Francisco) • 50% of employee are engineers • Expect to hit nearly $1 billion on global ad revenue in 2014 by eMarketer
TWITTER HISTORY • Evan Williams on the genesis of Twitter, ICWSM, April 2007: • A side project started from Jack Dorsey’s idea Oct, 2006 • Wanted a ubiquitous status message • A community of people answering the question “what are you doing?” • Exploded at SXSW, SF earthquakes (2011) • Good for collective “backchanneling” • High “Ambient intimacy” • Huge API usage was unexpected, as was the rise of the @ sign for replies
HOW BIG IS TWITTER ? Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html
IT’S NOT JUST BIG! IT’S FRESH! Source: http://xkcd.com/723/
TWITTER TOWN HALL July 6, 2011
TWITTER STATS Mapping the global Twitter heartbeat: The geography of Twitter, May 2013 Source: http://www.sgi.com/go/twitter/images/hires/figure4.png
TWITTER STATS Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20-February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.
TWITTER ACCOUNT • Register a Twitter account (required)
REGISTER A TWITTER APPLICATION • Twitter developer web site: https://dev.twitter.com/ • Select “My applications”
REGISTER A TWITTER APPLICATION • Click “Create a new application” Application List
REGISTER A TWITTER APPLICATION • Fill the required information 1. 2. 3.
REGISTER A TWITTER APPLICATION • Agree developer rules and fill captcha 1. 2.
REGISTER A TWITTER APPLICATION • Go back to application list and click your application • Click “Settings”
REGISTER A TWITTER APPLICATION • Select “Read, Write and Access direct messages” • Click “Update this Twitter application’s settings”
REGISTER A TWITTER APPLICATION • Click “Create my access token”
REST API Source: https://dev.twitter.com/docs/streaming-apis
STREAMING API Source: https://dev.twitter.com/docs/streaming-apis
TWEET CRAWL API Source: https://dev.twitter.com/docs/api/1.1 Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits
tmhOAuth LIBRARY • Website: https://github.com/themattharris/tmhOAuth • $ gitclone https://github.com/themattharris/tmhOAuth.git • Current Version 0.8.2 • Author: Matt Harris @themattharris • Goal: • Support OAuth 1.0A • Use authorization headers instead of query string or POST parameters • Allow uploading of images • Provide enough information to assist with debugging
CRAWLING WITH REST API • New a Oauth object contains authentication token • Set parameters for API • Use Twitter REST API to obtain tweets
CRAWLING WITH STREAMING API • New a Oauth object contains authentication token • Set parameters for API • Construct a connection to Twitter server
WHAT IS OAuth ? • OAuth = Open Authentication • What is OAuth: • An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications. • Goal of OAuth: • Request token URL • Authorize URL • Access token URL
SEARCH PARAMETERS (REST) Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets
SEARCH PARAMETERS (STREAMING) Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter
CRAWLING EFFICIENCY • Duration: May 6th to June 30th in 2012 (55 days) • REST API • Maximum TPS : 450 100 15 60 50 (Tweet / sec) • Steaming API • Randomly returns tweets containing a specific search keyword • The total quantity never exceeding 1% of all public data streams
SINGLE NODE CRAWLING TYPE • Guideline for single node crawling: • Each streaming needs to authenticate itself • Total data size seems bounded (i.e. #Tweet to crawler is limited) • Prevent aggressively connecting to Twitter server • Crawling with different Twitter accounts is recommended Tweets Streaming - A Tweets Streaming - B Tweets Streaming - C … Twitter Server Tweet Crawler
MULTI-NODE CRAWLING TYPE Tweets Streaming - A • Guideline for multi-node crawling: • Automatically check connection status • Automatically update databases summary information • Design the crawl program with well log file report function • Design a good database schema for distributed accessing Tweet Crawler Twitter Server Tweets Streaming - B Tweet Crawler
SETTING ENVIRONMENT • Install packages • # apt-get install php5 php5-curl • # apt-get install mysql-client mysql-server • # apt-get install phpmyadmin • Set Apache2 as web server when install phpymadmin
SETTING ENVIRONMENT • Create databsase and table for Tweet crawling • Create a *.sql file for database format • Change directory to that file • # mysql -h {$HOST} -u {$USER} -p{$PASSWORD} • mysql> \. {$SQL_FILE}
SETTING ENVIRONMENT • Check the database by phpmyadmin • Open browser and connect URL http://localhost/phpmyadmin • Select database and check the structure
CRAWLING REAL-TIME TWEETS • Connect database • Save Tweet into database