By Chris Orwa,
Junior Data Scientist, iHub Research
Note: Click on images to zoom
The R statistical computing software provides various packages for capturing data from different sources. In this blog, I’ll describe how to use R to collect data from Twitter. But first, let me provide a brief background on how to access Twitter data.
There are three levels to accessing Twitter data:
- Via the Search API which returns the most popular tweets. This is the data you get when you do a manual search on the Twitter website, and in this mode only a short historical query of the data is available (a few days).
- Via the Streaming API - it provides real-time access to twitter data but only provides access to a sample of all tweets. The Streaming API is suitable when you just want to get a feel on an occurring event.
- The last access level is through the Twitter Fire hose, which returns all tweets - both historical and real time - on given keywords. Unfortunately, only a few companies have this type of access.
To get started, first register an app on the Twitter developer website https://dev.twitter.com/ and fill in the details.
After successful registration of the app, click on the app and you should be able to see a similar table to the one below under ‘Oauth Settings’
|Access level||Read-onlyAbout the application permission model|
|Request token URL||https://api.twitter.com/oauth/request_token|
|Access token URL||https://api.twitter.com/oauth/access_token|
|Sign in with Twitter||No|
These settings will be used as variables in R so it is best to copy and save them in a text file.
Now you can fire-up R and download the twitter package with the command,
install.packages(“twitteR”)and load it on the workspace with the command
require(twitteR).Proceed with the code as in the diagram below by referring to the information you copied to the text file.
That’s pretty much standard code! Note that the consumer key and consumer secret key should be the one provided for your app (for this blog, I’ve used mine).The next part is quite tricky and gave me some headache.
After providing the credentials via the
OAuthFactory$new()function, a system handshake has to be initiated between your app and the Twitter server. A handshake in computing is a prior communication between two systems that sets the rules of the communication - in this case, it is implemented by digital certificates (SSL certificates) sent from the Twitter server acknowledging the app and setting type of information to be communicated.
The more straightforward and faster way to go round this is first to download the certificate with the R code:
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")Then proceed with the code below to initiate the handshake.
twitCred$handshake(cainfo="cacert.pem")Success at this stage should be in the form of a request for a pin in a textual format that reads:
To enable the connection, please direct your web browser to: https://api.twitter.com/oauth/authorize?oauth_token=kxzyNUke8nBprcClN4BTipXqgWKKn27Xf7We1qPJZECopy Paste the URL to your browser. This is what I got.
Punch back the PIN on the prompt as a reply below.
When complete, record the PIN given to you and provide it here.We are almost there! The next step is to register the credentials using the line below.
registerTwitterOAuth(twitCred)The function returns TRUE when all is well.
At this point I felt home was just a stone throw away only to be hit with an error while trying to use the search Twitter function.
 "SSL certificate problem, verify that the CA cert is OK. Details:\nerror:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed"
Error in twInterfaceObj$doAPICall(cmd, params, "GET", ...) :
Error: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failedIf this happened to you too, do not worry, I have the antidote. Set the SSL globally using the code below.
# Set SSL certs globally
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))Hooray!!! We’ve done it! Now we have the power to search, collect and analyze tweets.
How about we do something interesting? Let’s download tweets from a trending topic and graph the highest contributors to the topic. To do this, download and install the ggplot2 and plyr packages.Using the
searchTwitter()function in twitteR package, I captured 1000 tweets from the trending topic #TvYa13Million and graphed it.
Here is the code that did all the magic:
[PS: I didn't exit my R sessions, this code is a continuation of the above]
TvTweets = searchTwitter("#TvYa13Million",n=1000)
users <- ldply(TvTweets,function(x) return(x$screenName))
ggplot(users,aes(x=V1))+geom_histogram()+theme(axis.text.x = element_text(angle = 45, hjust = 1))+ylab("Count of tweets using #TvYa13Million")+xlab("Twitter handle")And there you go:
You now have the power to capture tweets and analyze them!