1 / 5

Update By: Brian Klug, Li Fan

Update By: Brian Klug, Li Fan. Presentation Overview : API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply) Infrastructure (Hardware, Storage Req’s , Design) Tentative Work Plan (Timeline and Schedule). API: Streaming API.

lerato
Download Presentation

Update By: Brian Klug, Li Fan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UpdateBy: Brian Klug, Li Fan • Presentation Overview: • API we plan to use(Syntax and commands) • Obtainable Data Types(Location, Text, Time, User, Reply) • Infrastructure (Hardware, Storage Req’s, Design) • Tentative Work Plan (Timeline and Schedule)

  2. API: Streaming API • Enables near-real time access to a subset of public Twitter statuses. • Currently in alpha test • Access to further restricted resources is extremely limited and granted only after acceptance of an additional TOS document. • We have applied for credentials which grant us access to these increased resources (namely a larger sampling, more statuses) • http://apiwiki.twitter.com/Streaming-API-Documentation • Features of streaming API • Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic authentication for the most basic level • Output data is in XML or JSON formats, both of which are easy to parse. • Can focus on certain tracking predicates that, when specific enough, return all occurrences in full Firehose stream • EG "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password

  3. Streaming API data • Example data: • {"truncated":false,"text":"@FreedomProject Can you bring the script tomorrow? We can write in the APE if you're not busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":"<a href=\"http://www.tweetdeck.com/\" rel=\"nofollow\">TweetDeck</a>","created_at":"Fri Nov 20 06:37:58 +0000 2009","in_reply_to_user_id":20688076,"in_reply_to_status_id":5882468251,"geo":null,"user":{"favourites_count":0,"verified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","description":"I'm a Robot created in Mexican soil, therefore my name is Mexican Robot","profile_background_image_url":"http://a3.twimg.com/profile_background_images/4329659/d2e513deb84e6fdc10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34:02 +0000 2008","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernetwaves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":"http://a3.twimg.com/profile_images/515614231/jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"MexicanRobot","friends_count":173,"screen_name":"MexicanRobot","id":18303131,"geo_enabled":false,"utc_offset":-28800,"profile_background_color":"000000","url":"http://sharkwithwheels.webs.com"},"id":5882552501} • Data Classes: • Who the message is in response to, if anyone • Client user agent • Location tagged geo-aware data, if any • Time of creation and time zone of poster • Information about avatar, background, profile • User metrics: Statuses posted, Followers, Friends • User description: short user-defined string

  4. Infrastructure • Streaming API expected volume: 3-4 million entries/day • Storage Consideration: • Average total JSON example output size: ~1400 characters • Messages are UTF-8, we’ll assume most are 1 byte • 1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day • 1 year ~ 1.6 terabytes • Currently working on getting at least one server running Ubuntu Server in a VM to begin downloading data • May require additional public IP addresses depending on rate limits, additional servers depending on load • Download first, parse later

  5. Tentative Timeline • Work Plan • Continue investigating using RSS to download status updates from far in the past beyond the 15,000 we are allowed to go back using the streaming API • 1-2 weeks: test our environment and make sure everything is working well • Make sure our methodology for downloading from the stream is resistant to Twitter downtime as features are rolled in and out of the alpha test • Await possible response from Twitter regarding access to additional restricted resources (even higher rate firehose) • 2 weeks to explore how to parse the content into a DB, whether this can be realistically done real time in another process. • Additional time for data mining, research topics, e.t.c.

More Related