1 / 24

Twitter, Big Data, and Other Ramblings

Twitter, Big Data, and Other Ramblings. Robert Dittmer. Perspective on those V words. Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem.

art
Download Presentation

Twitter, Big Data, and Other Ramblings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Twitter, Big Data, and Other Ramblings Robert Dittmer

  2. Perspective on those V words • Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem. • Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that. • Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External • Think about what goes into Amazon recommending you a book or movie • Veracity- It all means nothing if it’s not at least somewhat clean

  3. What do you do with a Tweet? • Sentiment Analysis is assigning a numerical value to a word • Positive, Negative, Neutral connotation • Methods for performing Sentiment Analysis • “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words” • “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context • Example Tweet • “The Apple iPad sucks. The new Google Nexus 7 is awesome!”

  4. Collecting Tweets • Twitter uses a RESTful service to stream Tweets • Steps to start streaming your own Tweets • Go to dev.twitter.com and create an application • Generate your OAuth credentials • Find an open-source Twitter library • Tweepy (Python) • Tweetinvi (C#) • Plug your credentials in and modify the example

  5. The Tweet, the Whole Tweet, and Nothing but the Tweet • JSON Format (Key-Value Pair) • Notable Fields • ID • CreatedAt • Text • Entities • Hashtags • URLs • Latitude, Longitude

  6. What does a Tweet look like? • {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,-4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"hashtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_box":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[-3.732303,50.429347]]]},"place_type":"city","name":"WestDevon","attributes":{},"country_code":"GB","url":"http://api.twitter.com/1/geo/id/0613276b16c0d59f.json","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en","profile_link_color":"FC0AFC","profile_banner_url":"https://si0.twimg.com/profile_banners/503242961/1354459551","id":503242961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"https://si0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://a0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","follow_request_sent":null,"url":"http://www.facebook.com/vickipolard","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use_background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503242961","profile_image_url":"http://a0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}

  7. My Tweet Collection • Collected for roughly one month • Lots of trial and error • Originally used Tweepy, but ran into errors • Switched to Tweetinvi and it worked • About 68 million Tweets • Apple • Amazon • Google • Microsoft • Netflix • Tesla • Ford (Probably should have used a different car company)

  8. Yahoo! Finance Detour • Use an HTTP request to get stock data • http://fiance.yahoo.com/d/quotes.csv?s=AAPL+GOOG+MSFT+YHOO+NFLX+AMZN+TSLA+F&f=snb2b3opl1t1d1 • Create a metric with stock data and compare the sentiment of a company to their performance

  9. Big Data (and regular data) Tools • Talend Open Studio • Hadoop • SAP HANA

  10. Talend Open Studio • Open Source ETL Tool • Built on Eclipse • Data Quality and Format Issues • Even though I saved Tweets in delimited format, issues remained • Iterated through all 12,736 files with 5000 tweets each • Verified each row against a schema • Mapped to different output files • Tweet (Fact table) • Tracks • User Mentions • Hashtags • URLs • Demo Time!

  11. Hadoop Overview • Based on the Hadoop Distributed File System and MapReduce • MapReduce is a way of parallelizing code using batch processing • Map finds the data you’re looking for • Reduce aggregates that data (count, sum, average) • Embarrassingly parallel processing • Each server in a Hadoop cluster is referred to as a Node • NameNode • DataNode • Blocks of data are replicated to three nodes • Extremely fault tolerant

  12. More Hadoop • Open-source technology • Cloudera vs. Hortonworks • Intel, IBM, MapR, Amazon EMR • Cloudera and Hortonworks are the two biggest faces of Hadoop • Intel actively contributes to optimize it for Xeon Processors • IBM and MapR also involved • Big companies and entities use it

  13. Hadoop Projects • Hive • Data Warehouse on top of Hadoop • Uses HiveQl (essentially SQL with a few extras) to query data • Abstracts MapReduce processes • Has an ODBC connector to allow it talk to anything that talks to databases • Pig • Uses a language called Pig Latin to analyze data • Data flow language abstracts MapReduce for easy use for data analysts • HBase • Billions of rows and millions of columns • Distributed column data store

  14. Hadoop Trivia Time • Who created Hadoop? • Why is it called Hadoop? • Who developed the concept of MapReduce? • What does Facebook Messenger use to store its data? • Who created Hive? • What is Accumulo and who created it?

  15. 2nd Generation Hadoop • Much faster than previous versions • Hive 0.12 is up to 50X faster than previous versions • Hortonworks Stinger project aims for 100X performance improvement • Projects like Spark are moving towards real-time analysis • In-memory cluster compute analysis • Streaming processing with routines written in Python and Scala • Shark is an implementation of Hive using Spark instead of MapReduce

  16. Hadoop Sentiment Analysis • Used the “Dumb” method of Sentiment Analysis • Import the data into HDFS and create Hive tables • Tweet • Sentiment Dictionary • Explode words in each tweet to create a view with TweetID and Word • Join with the Sentiment Dictionary on the word to get sentiment value • Demo Time!

  17. SAP HANA • In-Memory, Column-Store database • Loads all data into main-memory • Analyze billion of rows with sub-second response time • Column-store table structure • Allows for much better compression and parallelization than row-store • Used for real-time analytics • Available with an on premise appliance or cloud-based VM

  18. Why is SAP HANA Awesome? • Column-stores are naturally very good at parallelization • In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD • Feature rich • Text analytics • Predicative Analytics Library • Application Server • It is an actual Database and does everything a database does • Demo Time

  19. SAP HANA Sentiment Analysis • Sentiment is calculated when creating a full-text index on the text of the tweet • Creates a sentiment value for each tweet • Analyze by my different dimensions • Aggregate sentiment by hour • Demo Time!

  20. Other Text Analysis Options • Python Natural Language Toolkit • Analyze parts of speech and context • Should be possible to integrate with Hadoop (The Google did not help)

  21. Other Big Data Problems • A GE Engine on a transatlantic flight generates 2TB of sensor data • There’s four engines on a 747 • What does the LHC at CERN do with their 15 petabytes of data they create annually? • How does the NSA store a yottabyte of data? • How does a small online gaming company analyze their customer base to increase retention and margins?

  22. How is Sentiment Analysis Being Used? • Companies ingest their social media feeds into these systems • If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested

  23. Hot vs. Cold Data • Hot data is the recent data you are most interested in • Keep this data in SAP HANA for real-time processing • Archive it after a period of time: 1 month, 3 months, 6 months, etc… • Cold Data is your historical data • Data warehouses that can handle massive volumes of data are EXPENSIVE!!!! • Use Hadoop and Hive as your data warehouse • It only costs the hardware • Still able to analyze cold data, store it cheaply, and integrate with SAP HANA

More Related