Twitter, Big Data, and Other Ramblings

Twitter, Big Data, and Other Ramblings Robert Dittmer

Perspective on those V words • Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem. • Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that. • Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External • Think about what goes into Amazon recommending you a book or movie • Veracity- It all means nothing if it’s not at least somewhat clean

What do you do with a Tweet? • Sentiment Analysis is assigning a numerical value to a word • Positive, Negative, Neutral connotation • Methods for performing Sentiment Analysis • “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words” • “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context • Example Tweet • “The Apple iPad sucks. The new Google Nexus 7 is awesome!”

Collecting Tweets • Twitter uses a RESTful service to stream Tweets • Steps to start streaming your own Tweets • Go to dev.twitter.com and create an application • Generate your OAuth credentials • Find an open-source Twitter library • Tweepy (Python) • Tweetinvi (C#) • Plug your credentials in and modify the example

The Tweet, the Whole Tweet, and Nothing but the Tweet • JSON Format (Key-Value Pair) • Notable Fields • ID • CreatedAt • Text • Entities • Hashtags • URLs • Latitude, Longitude

What does a Tweet look like? • {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,-4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"hashtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_box":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[-3.732303,50.429347]]]},"place_type":"city","name":"WestDevon","attributes":{},"country_code":"GB","url":"http://api.twitter.com/1/geo/id/0613276b16c0d59f.json","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en","profile_link_color":"FC0AFC","profile_banner_url":"https://si0.twimg.com/profile_banners/503242961/1354459551","id":503242961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"https://si0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://a0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","follow_request_sent":null,"url":"http://www.facebook.com/vickipolard","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use_background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503242961","profile_image_url":"http://a0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}

My Tweet Collection • Collected for roughly one month • Lots of trial and error • Originally used Tweepy, but ran into errors • Switched to Tweetinvi and it worked • About 68 million Tweets • Apple • Amazon • Google • Microsoft • Netflix • Tesla • Ford (Probably should have used a different car company)

Yahoo! Finance Detour • Use an HTTP request to get stock data • http://fiance.yahoo.com/d/quotes.csv?s=AAPL+GOOG+MSFT+YHOO+NFLX+AMZN+TSLA+F&f=snb2b3opl1t1d1 • Create a metric with stock data and compare the sentiment of a company to their performance

Big Data (and regular data) Tools • Talend Open Studio • Hadoop • SAP HANA

Talend Open Studio • Open Source ETL Tool • Built on Eclipse • Data Quality and Format Issues • Even though I saved Tweets in delimited format, issues remained • Iterated through all 12,736 files with 5000 tweets each • Verified each row against a schema • Mapped to different output files • Tweet (Fact table) • Tracks • User Mentions • Hashtags • URLs • Demo Time!

Hadoop Overview • Based on the Hadoop Distributed File System and MapReduce • MapReduce is a way of parallelizing code using batch processing • Map finds the data you’re looking for • Reduce aggregates that data (count, sum, average) • Embarrassingly parallel processing • Each server in a Hadoop cluster is referred to as a Node • NameNode • DataNode • Blocks of data are replicated to three nodes • Extremely fault tolerant

More Hadoop • Open-source technology • Cloudera vs. Hortonworks • Intel, IBM, MapR, Amazon EMR • Cloudera and Hortonworks are the two biggest faces of Hadoop • Intel actively contributes to optimize it for Xeon Processors • IBM and MapR also involved • Big companies and entities use it

Hadoop Projects • Hive • Data Warehouse on top of Hadoop • Uses HiveQl (essentially SQL with a few extras) to query data • Abstracts MapReduce processes • Has an ODBC connector to allow it talk to anything that talks to databases • Pig • Uses a language called Pig Latin to analyze data • Data flow language abstracts MapReduce for easy use for data analysts • HBase • Billions of rows and millions of columns • Distributed column data store

Hadoop Trivia Time • Who created Hadoop? • Why is it called Hadoop? • Who developed the concept of MapReduce? • What does Facebook Messenger use to store its data? • Who created Hive? • What is Accumulo and who created it?

2nd Generation Hadoop • Much faster than previous versions • Hive 0.12 is up to 50X faster than previous versions • Hortonworks Stinger project aims for 100X performance improvement • Projects like Spark are moving towards real-time analysis • In-memory cluster compute analysis • Streaming processing with routines written in Python and Scala • Shark is an implementation of Hive using Spark instead of MapReduce

Hadoop Sentiment Analysis • Used the “Dumb” method of Sentiment Analysis • Import the data into HDFS and create Hive tables • Tweet • Sentiment Dictionary • Explode words in each tweet to create a view with TweetID and Word • Join with the Sentiment Dictionary on the word to get sentiment value • Demo Time!

SAP HANA • In-Memory, Column-Store database • Loads all data into main-memory • Analyze billion of rows with sub-second response time • Column-store table structure • Allows for much better compression and parallelization than row-store • Used for real-time analytics • Available with an on premise appliance or cloud-based VM

Why is SAP HANA Awesome? • Column-stores are naturally very good at parallelization • In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD • Feature rich • Text analytics • Predicative Analytics Library • Application Server • It is an actual Database and does everything a database does • Demo Time

SAP HANA Sentiment Analysis • Sentiment is calculated when creating a full-text index on the text of the tweet • Creates a sentiment value for each tweet • Analyze by my different dimensions • Aggregate sentiment by hour • Demo Time!

Other Text Analysis Options • Python Natural Language Toolkit • Analyze parts of speech and context • Should be possible to integrate with Hadoop (The Google did not help)

Other Big Data Problems • A GE Engine on a transatlantic flight generates 2TB of sensor data • There’s four engines on a 747 • What does the LHC at CERN do with their 15 petabytes of data they create annually? • How does the NSA store a yottabyte of data? • How does a small online gaming company analyze their customer base to increase retention and margins?

How is Sentiment Analysis Being Used? • Companies ingest their social media feeds into these systems • If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested

Hot vs. Cold Data • Hot data is the recent data you are most interested in • Keep this data in SAP HANA for real-time processing • Archive it after a period of time: 1 month, 3 months, 6 months, etc… • Cold Data is your historical data • Data warehouses that can handle massive volumes of data are EXPENSIVE!!!! • Use Hadoop and Hive as your data warehouse • It only costs the hardware • Still able to analyze cold data, store it cheaply, and integrate with SAP HANA

Twitter, Big Data, and Other Ramblings

Twitter, Big Data, and Other Ramblings

Presentation Transcript

AMCman Ramblings

AMCman Ramblings

Data Mining and Twitter

Big Data and Data Mining

Big Data and NoSQL

Biomedicine and Big Data

Big Data, Big Knowledge, and Big Crowd

NLP and Big Data

Big Data and Usability

AMCman Ramblings

Big Data and Analytics

Big Questions, Big Data and Big Answers

Biomedicine and Big Data

‘New’ data – big, open, linked, semantic and other such terms…

AMCman Ramblings

AMCman Ramblings

Big Data and Hadoop

Big Data and Hadoop

Big Data Big Data

AMCman Ramblings