Using twitteR to see, what german press secretary tweets about

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

What this is about

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

  • What devices does he use to tweet?
  • What is the frequency of his tweets, and on which days was he most active?
  • Is there any particular pattern about when he tweets?
  • What is he tweeting about?

Load data

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

Which device does he use?

  • Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

Analysis of frequency

  • The tweets are summarized by day, and the bars show the amount of tweets per day
  • There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
  • Normal “twee-rate” seems to be between 3 and 10 per day
  • There are even some “idle” days, where he did not tweet at all – holidays?

At what time of day is he tweeting?

  • The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
  • The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 
  • The most frequent terms are shown above ; they occur each more than 30 times.
  • Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
  • We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
  • The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
  • Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany 😉

The Code below:

## @knitr setup
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE)
# upload images automatically
opts_knit$set( = imgur_upload)

## @knitr load_data
# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...

## @knitr device
# Code from vignette of twitteR-package
  sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource())
  sources <- gsub("", "", sources)
  sources <- strsplit(sources, ">")
  sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))

## @knitr freq
  ggplot() +
 geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) +
 scale_y_continuous(name = 'Frequency, # tweets/day') +
 scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b'))

## @knitr time
plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day')
plot2 <- ggplot() +
 geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) +
 scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20,
 22,2,4,24)),limits = c(0,24)) +
 scale_y_continuous(name = '# tweets')
grid.arrange(plot1, plot2, ncol=2)

## @knitr words
# this passage is entirely from
# build a corpus
mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 
# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <-
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

Created by Pretty R at