50 likes | 173 Views
Update By: Brian Klug, Li Fan. Presentation Overview : API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply) Infrastructure (Hardware, Storage Req’s , Design) Tentative Work Plan (Timeline and Schedule). API: Streaming API.
E N D
UpdateBy: Brian Klug, Li Fan • Presentation Overview: • API we plan to use(Syntax and commands) • Obtainable Data Types(Location, Text, Time, User, Reply) • Infrastructure (Hardware, Storage Req’s, Design) • Tentative Work Plan (Timeline and Schedule)
API: Streaming API • Enables near-real time access to a subset of public Twitter statuses. • Currently in alpha test • Access to further restricted resources is extremely limited and granted only after acceptance of an additional TOS document. • We have applied for credentials which grant us access to these increased resources (namely a larger sampling, more statuses) • http://apiwiki.twitter.com/Streaming-API-Documentation • Features of streaming API • Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic authentication for the most basic level • Output data is in XML or JSON formats, both of which are easy to parse. • Can focus on certain tracking predicates that, when specific enough, return all occurrences in full Firehose stream • EG "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password
Streaming API data • Example data: • {"truncated":false,"text":"@FreedomProject Can you bring the script tomorrow? We can write in the APE if you're not busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":"<a href=\"http://www.tweetdeck.com/\" rel=\"nofollow\">TweetDeck</a>","created_at":"Fri Nov 20 06:37:58 +0000 2009","in_reply_to_user_id":20688076,"in_reply_to_status_id":5882468251,"geo":null,"user":{"favourites_count":0,"verified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","description":"I'm a Robot created in Mexican soil, therefore my name is Mexican Robot","profile_background_image_url":"http://a3.twimg.com/profile_background_images/4329659/d2e513deb84e6fdc10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34:02 +0000 2008","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernetwaves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":"http://a3.twimg.com/profile_images/515614231/jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"MexicanRobot","friends_count":173,"screen_name":"MexicanRobot","id":18303131,"geo_enabled":false,"utc_offset":-28800,"profile_background_color":"000000","url":"http://sharkwithwheels.webs.com"},"id":5882552501} • Data Classes: • Who the message is in response to, if anyone • Client user agent • Location tagged geo-aware data, if any • Time of creation and time zone of poster • Information about avatar, background, profile • User metrics: Statuses posted, Followers, Friends • User description: short user-defined string
Infrastructure • Streaming API expected volume: 3-4 million entries/day • Storage Consideration: • Average total JSON example output size: ~1400 characters • Messages are UTF-8, we’ll assume most are 1 byte • 1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day • 1 year ~ 1.6 terabytes • Currently working on getting at least one server running Ubuntu Server in a VM to begin downloading data • May require additional public IP addresses depending on rate limits, additional servers depending on load • Download first, parse later
Tentative Timeline • Work Plan • Continue investigating using RSS to download status updates from far in the past beyond the 15,000 we are allowed to go back using the streaming API • 1-2 weeks: test our environment and make sure everything is working well • Make sure our methodology for downloading from the stream is resistant to Twitter downtime as features are rolled in and out of the alpha test • Await possible response from Twitter regarding access to additional restricted resources (even higher rate firehose) • 2 weeks to explore how to parse the content into a DB, whether this can be realistically done real time in another process. • Additional time for data mining, research topics, e.t.c.