290 likes | 603 Views
Twitter, Big Data, and Other Ramblings. Robert Dittmer. Perspective on those V words. Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem.
E N D
Twitter, Big Data, and Other Ramblings Robert Dittmer
Perspective on those V words • Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem. • Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that. • Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External • Think about what goes into Amazon recommending you a book or movie • Veracity- It all means nothing if it’s not at least somewhat clean
What do you do with a Tweet? • Sentiment Analysis is assigning a numerical value to a word • Positive, Negative, Neutral connotation • Methods for performing Sentiment Analysis • “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words” • “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context • Example Tweet • “The Apple iPad sucks. The new Google Nexus 7 is awesome!”
Collecting Tweets • Twitter uses a RESTful service to stream Tweets • Steps to start streaming your own Tweets • Go to dev.twitter.com and create an application • Generate your OAuth credentials • Find an open-source Twitter library • Tweepy (Python) • Tweetinvi (C#) • Plug your credentials in and modify the example
The Tweet, the Whole Tweet, and Nothing but the Tweet • JSON Format (Key-Value Pair) • Notable Fields • ID • CreatedAt • Text • Entities • Hashtags • URLs • Latitude, Longitude
What does a Tweet look like? • {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,-4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"hashtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_box":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[-3.732303,50.429347]]]},"place_type":"city","name":"WestDevon","attributes":{},"country_code":"GB","url":"http://api.twitter.com/1/geo/id/0613276b16c0d59f.json","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en","profile_link_color":"FC0AFC","profile_banner_url":"https://si0.twimg.com/profile_banners/503242961/1354459551","id":503242961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"https://si0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://a0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","follow_request_sent":null,"url":"http://www.facebook.com/vickipolard","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use_background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503242961","profile_image_url":"http://a0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}
My Tweet Collection • Collected for roughly one month • Lots of trial and error • Originally used Tweepy, but ran into errors • Switched to Tweetinvi and it worked • About 68 million Tweets • Apple • Amazon • Google • Microsoft • Netflix • Tesla • Ford (Probably should have used a different car company)
Yahoo! Finance Detour • Use an HTTP request to get stock data • http://fiance.yahoo.com/d/quotes.csv?s=AAPL+GOOG+MSFT+YHOO+NFLX+AMZN+TSLA+F&f=snb2b3opl1t1d1 • Create a metric with stock data and compare the sentiment of a company to their performance
Big Data (and regular data) Tools • Talend Open Studio • Hadoop • SAP HANA
Talend Open Studio • Open Source ETL Tool • Built on Eclipse • Data Quality and Format Issues • Even though I saved Tweets in delimited format, issues remained • Iterated through all 12,736 files with 5000 tweets each • Verified each row against a schema • Mapped to different output files • Tweet (Fact table) • Tracks • User Mentions • Hashtags • URLs • Demo Time!
Hadoop Overview • Based on the Hadoop Distributed File System and MapReduce • MapReduce is a way of parallelizing code using batch processing • Map finds the data you’re looking for • Reduce aggregates that data (count, sum, average) • Embarrassingly parallel processing • Each server in a Hadoop cluster is referred to as a Node • NameNode • DataNode • Blocks of data are replicated to three nodes • Extremely fault tolerant
More Hadoop • Open-source technology • Cloudera vs. Hortonworks • Intel, IBM, MapR, Amazon EMR • Cloudera and Hortonworks are the two biggest faces of Hadoop • Intel actively contributes to optimize it for Xeon Processors • IBM and MapR also involved • Big companies and entities use it
Hadoop Projects • Hive • Data Warehouse on top of Hadoop • Uses HiveQl (essentially SQL with a few extras) to query data • Abstracts MapReduce processes • Has an ODBC connector to allow it talk to anything that talks to databases • Pig • Uses a language called Pig Latin to analyze data • Data flow language abstracts MapReduce for easy use for data analysts • HBase • Billions of rows and millions of columns • Distributed column data store
Hadoop Trivia Time • Who created Hadoop? • Why is it called Hadoop? • Who developed the concept of MapReduce? • What does Facebook Messenger use to store its data? • Who created Hive? • What is Accumulo and who created it?
2nd Generation Hadoop • Much faster than previous versions • Hive 0.12 is up to 50X faster than previous versions • Hortonworks Stinger project aims for 100X performance improvement • Projects like Spark are moving towards real-time analysis • In-memory cluster compute analysis • Streaming processing with routines written in Python and Scala • Shark is an implementation of Hive using Spark instead of MapReduce
Hadoop Sentiment Analysis • Used the “Dumb” method of Sentiment Analysis • Import the data into HDFS and create Hive tables • Tweet • Sentiment Dictionary • Explode words in each tweet to create a view with TweetID and Word • Join with the Sentiment Dictionary on the word to get sentiment value • Demo Time!
SAP HANA • In-Memory, Column-Store database • Loads all data into main-memory • Analyze billion of rows with sub-second response time • Column-store table structure • Allows for much better compression and parallelization than row-store • Used for real-time analytics • Available with an on premise appliance or cloud-based VM
Why is SAP HANA Awesome? • Column-stores are naturally very good at parallelization • In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD • Feature rich • Text analytics • Predicative Analytics Library • Application Server • It is an actual Database and does everything a database does • Demo Time
SAP HANA Sentiment Analysis • Sentiment is calculated when creating a full-text index on the text of the tweet • Creates a sentiment value for each tweet • Analyze by my different dimensions • Aggregate sentiment by hour • Demo Time!
Other Text Analysis Options • Python Natural Language Toolkit • Analyze parts of speech and context • Should be possible to integrate with Hadoop (The Google did not help)
Other Big Data Problems • A GE Engine on a transatlantic flight generates 2TB of sensor data • There’s four engines on a 747 • What does the LHC at CERN do with their 15 petabytes of data they create annually? • How does the NSA store a yottabyte of data? • How does a small online gaming company analyze their customer base to increase retention and margins?
How is Sentiment Analysis Being Used? • Companies ingest their social media feeds into these systems • If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested
Hot vs. Cold Data • Hot data is the recent data you are most interested in • Keep this data in SAP HANA for real-time processing • Archive it after a period of time: 1 month, 3 months, 6 months, etc… • Cold Data is your historical data • Data warehouses that can handle massive volumes of data are EXPENSIVE!!!! • Use Hadoop and Hive as your data warehouse • It only costs the hardware • Still able to analyze cold data, store it cheaply, and integrate with SAP HANA