BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw http://bermuda.citi.sinica.edu.tw

BIG Data &Twitter

WHAT IS BIG DATA ? In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. 《Wikipedia Big data》 Source: http://en.wikipedia.org/wiki/Big_data

WHAT IS BIG DATA ? • In 2001, Doug Laney use 3V model to describe Big Data • Volume: amount of data • Velocity: speed of data in and out • Variety: range of data types and sources • Veracity: truth or fact of data

WHAT IS BIG DATA ? • In 2012, Gartner updated the definition • Still advocate 3V model for describing data • Require new forms of processing • Enhanced decision making • Insight discovery • Process optimization

HOW BIG IS BIG DATA ? • Beyond the ability of commonly used • A few dozen terabytes (107) to many petabytes (108) • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day

NEW TECHNOLOGY FOR BIG DATA • Hadoop • Developed by Apache Software Foundation • Derived from Google's MapReduce & File System • Able to process peta-bytes scale database • NoSQL (Not Only SQL) • Relational databases is not applicable for all cases • NoSQL is a new choose for non-relational databases • Adopted by Google, Facebook, Twitter, etc.

WHAT IS TWITTER? • The fastest, simplest way to communicate • More than 140M active users • Majority source from mobile • 60% of user is out of U.S. • More than 400M twitter.com visitors • More than 400M tweets/day (peak: 25K/sec) • 1,000 employees (majority in San Francisco) • 50% of employee are engineers • Expect to hit nearly $1 billion on global ad revenue in 2014 by eMarketer

TWITTER HISTORY • Evan Williams on the genesis of Twitter, ICWSM, April 2007: • A side project started from Jack Dorsey’s idea Oct, 2006 • Wanted a ubiquitous status message • A community of people answering the question “what are you doing?” • Exploded at SXSW, SF earthquakes (2011) • Good for collective “backchanneling” • High “Ambient intimacy” • Huge API usage was unexpected, as was the rise of the @ sign for replies

HOW BIG IS TWITTER ? Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

IT’S NOT JUST BIG! IT’S FRESH! Source: http://xkcd.com/723/

WHAT IS TWEET ?

TWITTER TOWN HALL July 6, 2011

TWITTER STATS Mapping the global Twitter heartbeat: The geography of Twitter, May 2013 Source: http://www.sgi.com/go/twitter/images/hires/figure4.png

TWITTER STATS

TWITTER STATS Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20-February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.

TWITTER STATS

Twitter Dev

TWITTER ACCOUNT • Register a Twitter account (required)

REGISTER A TWITTER APPLICATION • Twitter developer web site: https://dev.twitter.com/ • Select “My applications”

REGISTER A TWITTER APPLICATION • Click “Create a new application” Application List

REGISTER A TWITTER APPLICATION • Fill the required information 1. 2. 3.

REGISTER A TWITTER APPLICATION • Agree developer rules and fill captcha 1. 2.

REGISTER A TWITTER APPLICATION • Go back to application list and click your application • Click “Settings”

REGISTER A TWITTER APPLICATION • Select “Read, Write and Access direct messages” • Click “Update this Twitter application’s settings”

REGISTER A TWITTER APPLICATION • Click “Create my access token”

REGISTER A TWITTER APPLICATION

Twitter API Resource

REST API Source: https://dev.twitter.com/docs/streaming-apis

STREAMING API Source: https://dev.twitter.com/docs/streaming-apis

TWEET CRAWL API Source: https://dev.twitter.com/docs/api/1.1 Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits

tmhOAuth LIBRARY • Website: https://github.com/themattharris/tmhOAuth • $ gitclone https://github.com/themattharris/tmhOAuth.git • Current Version 0.8.2 • Author: Matt Harris @themattharris • Goal: • Support OAuth 1.0A • Use authorization headers instead of query string or POST parameters • Allow uploading of images • Provide enough information to assist with debugging

CRAWLING WITH REST API • New a Oauth object contains authentication token • Set parameters for API • Use Twitter REST API to obtain tweets

CRAWLING WITH STREAMING API • New a Oauth object contains authentication token • Set parameters for API • Construct a connection to Twitter server

WHAT IS OAuth ? • OAuth = Open Authentication • What is OAuth: • An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications. • Goal of OAuth: • Request token URL • Authorize URL • Access token URL

NORMAL SEARCH OPERATORS

SEARCH PARAMETERS (REST) Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets

SEARCH PARAMETERS (STREAMING) Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter

WHAT DOES A TWEET LOOK LIKE?

CRAWLING EFFICIENCY • Duration: May 6th to June 30th in 2012 (55 days) • REST API • Maximum TPS : 450 100 15 60 50 (Tweet / sec) • Steaming API • Randomly returns tweets containing a specific search keyword • The total quantity never exceeding 1% of all public data streams

LARGE-SCALE CRAWLING

Twitter +MySQL

SINGLE NODE CRAWLING TYPE • Guideline for single node crawling: • Each streaming needs to authenticate itself • Total data size seems bounded (i.e. #Tweet to crawler is limited) • Prevent aggressively connecting to Twitter server • Crawling with different Twitter accounts is recommended Tweets Streaming - A Tweets Streaming - B Tweets Streaming - C … Twitter Server Tweet Crawler

MULTI-NODE CRAWLING TYPE Tweets Streaming - A • Guideline for multi-node crawling: • Automatically check connection status • Automatically update databases summary information • Design the crawl program with well log file report function • Design a good database schema for distributed accessing Tweet Crawler Twitter Server Tweets Streaming - B Tweet Crawler

DESIGN TWEET TABLE

SETTING ENVIRONMENT • Install packages • # apt-get install php5 php5-curl • # apt-get install mysql-client mysql-server • # apt-get install phpmyadmin • Set Apache2 as web server when install phpymadmin

SETTING ENVIRONMENT • Create databsase and table for Tweet crawling • Create a *.sql file for database format • Change directory to that file • # mysql -h {$HOST} -u {$USER} -p{$PASSWORD} • mysql> \. {$SQL_FILE}

SETTING ENVIRONMENT • Check the database by phpmyadmin • Open browser and connect URL http://localhost/phpmyadmin • Select database and check the structure

CRAWLING REAL-TIME TWEETS • Connect database • Save Tweet into database

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database