90 likes | 117 Views
Learn how to use the Facebook API to scrape data in R, including obtaining an API key, creating an OAuth token, and using functions from the Rfacebook package.
E N D
Scraping Facebook via API in R . ShashankHebbar, Ph.D. Student, Analytics and Data Science Kennesaw State University
What is an API? API is short for Application Programming Interface. Basically, it means a way of accessing the functionality of a program from inside another program. So instead of performing an action using an interface that was made for humans, a point and click GUI for instance, an API allows a program to perform that action automatically. Todays API , usually refer to the API that are based on World Wide Web’s HTTP Protocol, that is also used by web servers and browsers to exchange data. 2
API Identification /authorization API key (aka token). A key is used to identify the user along with track and control how the API is being used (guard against malicious use). A key is often obtained by supplying basic information (i.e. name, email) to the organization and in return they give you a multi-digit key. OAuth is an authorization framework that provides credentials as proof for access to certain information. Many APIs are open to the public and only require an API key; however, some APIs require authorization to account data (think personal Facebook & Twitter accounts) R has an extensive list of packages in which API data feeds have been hooked into R. You can find a slew of them scattered throughout the https://cran.r-project.org/web/views/WebTechnologies.html. 3
Facebook API Register a new application From Facebook Developer click on Apps at the top of the page to go to the application dashboard. Click the fb-create-new-app-button button near the top. Once you are done with the verification process, your application is created. Note down the App Id & App Secret 4
Create OAuth token to Facebook R session. fbOAuth creates a long-lived OAuth access token that enables R to make authenticated calls to the Facebook API. 5
Functions from Rfacebook Package function getLikes getLikes(user, n = n , token): Extract list of liked pages of a Facebook user with page id. Arguments: user: user name/ID , n: Number of liked pages to return for user. searchPages(, token, n = n): It Search pages that having a string/keyword. Arguments: string: any string , n: Number of pages to return function getPage getPage(page , token, n = n): Extract list of posts from a public Facebook page. 6
Analyzing data from a Facebook page For example, assume that we're interested in learning about how the Facebook page Humans of New York has become popular, and what type of audience it has. The first step would be to retrieve a data frame with information about all its posts Using this data frame, it is relatively straightforward to visualize how the popularity of Humans of New York has grown exponentially over time. 7
Other API Packages in R Some of the popular packages are • blsAPIfor pulling U.S. Bureau of Labor Statistics data • rnoaafor pulling NOAA climate data • rtimesfor pulling data from multiple APIs offered by the New York Times The rnoaa package allows users to request climate data from multiple data sets through the National Climatic Data Center API. Unlike blsAPI, the rnoaa app requires you to have an API key. To request a key go to https://www.ncdc.noaa.gov/cdo-web/token and provide your email; a key will immediately be emailed to you.
What if there is no package for that API?! Although numerous R API packages are available, and cover a wide range of data, you may eventually run into a situation where you want to leverage an organization’s API but an R package does not exist. This is where httr comes in. httr was developed by Hadley Wickham to easily work with web APIs. One of the popular function here is Get(). We use the Get() function to access an API, provide it some request parameters, and receive an output. httr is designed to map closely to the underlying http protocol. There are two important parts to http: the request, the data sent to the server, and the response, the data sent back from the server