Mining Twitter

Collecting data from Twitter

Collecting data is always an important step for data miners and if this data comes from Twitter it sounds even much more interesting for text miners.
We will introduce the possibilities of R package TwitteR as regards collecting data from Twitter.

Establishing a conection with Twitter

First of all we will need to open a secure connection with Twitter. To do so, please refer to page R OAuth for TwitteR

Searching at Twitter

# Let's start by loading library and certification
> library(twitteR)
> load("twitteR_credentials")
> registerTwitterOAuth(twitCred)

# Say that we are interested in tweets regarding company Continental airlines which has a twitter handler @United. We will proceed to collect 1.000 tweets.
> un.tweets = searchTwitter('@United',n=1000, cainfo="cacert.pem")

# Let's answer a few basic questions
# how many tweets have we collected ?
> length(un.tweets)

# Show tweet number 500
> tweet500 = un.tweets[[500]]

# Show only the text from tweet number 500
> tweet500$getText()

# Show only the user from tweet number 500
> tweet500$getScreenName()

# To go further in our example we will load a new package (plyr) which contains tools for splitting, applying and combining data
> library(plyr)

# Get all text from our twitter data set in data frame format
> un.text = lapply(un.tweets, function(t)t$getText())

# Now we only have text entries in our data set un.text. 
# Sow the first 5 entries
> head(un.text, 5)

# Another interesting way to work data from Twitter would be first of all to get the Trending Topics.
This function will return the top 30 trending topics for each day of the week starting with yesterday.
> tr <- getTrends(’weekly’, as.character(Sys.Date()-1))

# And after that, extract data via function searchTwitter about one specific Trend

Mining the text

Once we have a data set in data frame R format, it is time for mining the text, for instance, by implementing functions included in R package (tm) ...let's get it started.
> library(tm)

# Let's apply some transformations to our Twitter data set
# Build a Corpus
> data.corpus <- Corpus(VectorSource(un.text))

# Convert to lowercase
> data.corpus <- tm_map(data.corpus, tolower)

# remove punctuation
> data.corpus <- tm_map(data.corpus, removePunctuation)
# remove stop words, otherwise some of these words would appear as most used
> some_stopwords <- c(stopwords('english'))
> data.corpus <- tm_map(data.corpus, removeWords, some_stopwords)
# build a term-document matrix from a corpus
> data.dtm <- TermDocumentMatrix(data.corpus)

 # Some commands to view corpora data
# inspect the document-term matrix
> data.dtm

A term-document matrix (3053 terms, 1000 documents)

Non-/sparse entries: 9931/3043069
Sparsity : 100%
Maximal term length: 152
Weighting : term frequency (tf)

# View data after transformations
> inspect(data.corpus[1:1000])

# View one single entry
> data.corpus[[116]]

# In a document-term-matrix select 10 words and inspect 10 entries
> inspect(data.dtm[1:10,1:10])

# Functions with basic statistics
# inspect most popular words
> findFreqTerms(data.dtm, lowfreq=30)

[1] "agent"  "airline"  "airlines"  "call"  "cant"       
[6] "change"  "continental"  "customer"  "day"  "flight"     
[11] "flights"  "flying"  "help"  "hharteveldt"  "hold"       
[16] "lost"  "miles"  "minutes"  "phone"  "rwang0"     
[21] "seat"  "service"  "site"  "thanks"  "the"        
[26] "time"  "travel"  "united"  "website"

# Find associations with "continental" higher than 20%
> findAssocs(data.dtm, "continental", 0.2) 

At this point we have extracted the most used words in tweets regarding Continental airlines, ... cool  (~_~;)   isn't it ?

You can post any comment about this article at

Subpages (1): R OAuth for TwitteR