70 likes | 208 Views
Description of Twitter Data Streamed Using Twitter4j. Streamed data Vs. Query database. Two methods to get the data we need : S treamed from Twitter (This is what we have chosen) Advantage: It is free and takes no time to get.
E N D
Streamed data Vs. Query database Two methods to get the data we need : • Streamed from Twitter (This is what we have chosen) • Advantage: It is free and takes no time to get. • Con: Subject to availability--twitter only release part of its stream data to personal accounts but more to those who with organization accounts. However, we do not know exactly how much data does twitter release to the public. • Query their database • Advantage: Data is more complete than streamed data • Con: The query takes too much time and it is not free.
Crawler Tools • The API available Twitter REST API is provided by twitter. Now it is version 1.1. It provides functions for users to connect to Twitter’s API server and get streamed data. • The Tool we used Twitter4j is what we used to get the data. It is an open sourced software that implements many of Twitter API’s functions. The author is Yusuke Yamamoto. The following is the web link: http://twitter4j.org/en/index.html
Twitter4j Functions • ConfigurationBuilder() Build a configuration object with OAuth key, OAuth secret, AccessToken, and AccessToken secret which all need to be obtained by registering with twitter through OAuth authentication system. We registered a personal account through this system • TwitterStreamFactory() Takes the authentication information as input and establishes connection with the steaming server. • StatusListener() This is the event catcher and it catches any incoming stream and store it in a Status object
Twitter4j Functions Continued • Status() This creates Status objects that contains user name, user ID, status ID, text content of the status, time created, what kind of devices the user used to post the status, geographical location of the user when the status is posted( only if user choose to release it), how many times this status has been retweeted, user of the status and some other information. • User() This creates user object that returns the information about the user who posted the status
The Data We have crawled the data for 3 days. The data are arranged into 3 types, all in txt format:
A little side note on the raw data The raw data is about 17 times bigger in size when compared to the Sina data: Each txt file contains 4000 entries and take around 9Mb disk space while each Sina Data’s file contains 70000 entries and is about 8Mb . We can download about 6Gb data per day, so we will have many files created to keep those data. The file numbers will result in slow transfer of file as well as processing the data for research later on.