Analyzing Twitter data with R (part 3: Cleaning & organizing the Data)

Aug 30, 2015

Analyzing Twitter data with R (part 3: Cleaning & organizing the Data)


After we have explained in the previous parts, how to set up the access to Twitter's API and how to import tweets with a simple R command, in this third part we will try to organize and clean the data we have imported in good reusable format.

So let's pick up from where we left off, after finishing the authentication process, now we want to import certain tweets. the command searchTwitter  from the package twitteR will do the job for us.
 we need to create a new object that will contain the data we are about to import and give it the result of our searchTwitter function.

In our example, we will be querying Twitter for the tweets containing the hashtag #Tunisia, we will import 1000 English tweets. and we will have a preview of the imported data with the head()  function:






As you can see, the head function gave us a preview of the Tun_Tweets object. We have 1000 tweets with all the information related to them like the creation date the user's name , the url of the tweet and a lot of other data that can serve us later on.
In the preview we can see the user's name followed by the tweet's text and the tweet's url.
For our little project here, the only data we will probably need is the text of the tweets.

Let's take a look at the characteristics of the contents of Tun_Tweets object created by the searchTwitter function:


As you have already seen, we have a good number of data imported with the tweets, from the user's name to the latitude and longitude information of the tweet and many more data.
If you are interested to know about the whole list of the data extracted, you can check the Twitter documentation.


Now comes the fun part, shaping the data and getting it in a format we can use for the final task, which is a basic text-mining task.
To accomplish this goal, we will need to call  the tm package.

But before that we need to put our tweets in a data frame format:


We made sure that we have the same content of the data with the str function . Now we have a data frame with 1000 observation and 16 columns ready to be transformed into a corpus with the tm package :



Now the transformations!
For our project we will perform 5 basic transformations or data cleaning task:

  1. Removing Extra white spaces
  2. Converting all the text to lowercase 
  3. Removing stop-words
  4. Removing punctuation symbols 
  5. Removing numbers

Those are basically all the transformations we will need to move to our next step which is making a nice wordcloud!

As usual, here is the code I've used :