R talks to Weka about Data Mining

R provides us with excellent resources to mine data, and there are some good overviews out there:

And there are other tools out there for data mining, like Weka.

Weka has a GUI and can be directed via the command line with Java as well, and Weka has a large variety of algorithms included. If, for whatever reason, you do not find the algorithm you need being implemented in R, Weka might be the place to go. And the RWeka-package marries R and Weka.

I am not an expert neither in R, nor in Weka, nor in data mining. But I happen to play around with them, and I’d like to share a starter on how to work with them. There is good documentation out there (e.g. Open-Source Machine Learning: R Meets Weka or RWeka Odds and Ends), but sometimes you want to document your own steps and ways of working, and this is what I do.

So, I want to build a classification model for the iris-dataset, based on a tree classifier. Joice is the C4.5 algorithm that I did not find implemented in any standard R package.

UPDATE: “Z” made the comment below, that the C5.0-algorithm is implemented in the C50-package. And indeed, Max Kuhn has given a nice presentation about it at UseR! 2013, that you can find here.

We want to predict the class of a flower based on their attributes, namely sepal and petal width and length. The three species we have are “setosa”, “versicolor” and “virginica”. A short summary is given above.

Prediction with J48 (aka C4.5)

We next load the RWeka package.

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

library(RWeka)

We now build the classifier, and this works with the J48(.)-function:

iris_j48 <- J48(Species ~ ., data = iris)
iris_j48

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  5
## 
## Size of the tree :   9

summary(iris_j48)

## 
## === Summary ===
## 
## Correctly Classified Instances         147               98      %
## Incorrectly Classified Instances         3                2      %
## Kappa statistic                          0.97  
## Mean absolute error                      0.0233
## Root mean squared error                  0.108 
## Relative absolute error                  5.2482 %
## Root relative squared error             22.9089 %
## Coverage of cases (0.95 level)          98.6667 %
## Mean rel. region size (0.95 level)      34      %
## Total Number of Instances              150     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  50  0  0 |  a = setosa
##   0 49  1 |  b = versicolor
##   0  2 48 |  c = virginica

# plot(iris_j48) 
library(partykit)
plot(as.party(iris_j48)) # we use the partykit-package for nice plotting.

plot of chunk build_J48

We can assign the model to an object, and printing the object gives us the tree in “Weka-Output”, summary(.) gives us the Summary of the classification on the training set (again, in Weka-style), and plot(.) allows us to nicely plot it, while plot.party(.) gives us an even nicer plot (thanks to Z for his comment below).

Evaluation in Weka

Well, we used the whole dataset now for training, but we actually might want to perform cross-validation. This can be done like this:

eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE, 
    seed = 1, class = TRUE)
eval_j48

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances         144               96      %
## Incorrectly Classified Instances         6                4      %
## Kappa statistic                          0.94  
## Mean absolute error                      0.035 
## Root mean squared error                  0.1586
## Relative absolute error                  7.8705 %
## Root relative squared error             33.6353 %
## Coverage of cases (0.95 level)          96.6667 %
## Mean rel. region size (0.95 level)      33.7778 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     setosa
##                  0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     versicolor
##                  0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     virginica
## Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   0 47  3 |  b = versicolor
##   0  2 48 |  c = virginica

We see slightly worse results now, as you would suspect.

Using Weka-controls

We used the standard options for th J48 classifier, but Weka allows more. You can acces these with the WOW-function:

WOW("J48")

## -U      Use unpruned tree.
## -O      Do not collapse tree.
## -C <pruning confidence>
##         Set confidence threshold for pruning.  (default 0.25)
##  Number of arguments: 1.
## -M <minimum number of instances>
##         Set minimum number of instances per leaf.  (default 2)
##  Number of arguments: 1.
## -R      Use reduced error pruning.
## -N <number of folds>
##         Set number of folds for reduced error pruning. One fold is
##         used as pruning set.  (default 3)
##  Number of arguments: 1.
## -B      Use binary splits only.
## -S      Don't perform subtree raising.
## -L      Do not clean up after the tree has been built.
## -A      Laplace smoothing for predicted probabilities.
## -J      Do not use MDL correction for info gain on numeric
##         attributes.
## -Q <seed>
##         Seed for random data shuffling (default 1).
##  Number of arguments: 1.

If, for example, we want to use a tree with minimum 10 instances in each leaf, we change the command as follows:

j48_control <- J48(Species ~ ., data = iris, control = Weka_control(M = 10))
j48_control

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7: versicolor (54.0/5.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  3
## 
## Size of the tree :   5

And you see the tree is different (well, it just does not go as deep as the other one..).

Building cost-sensitive classifiers

You might want to include a cost matrix, i.e you want to penalize some wrong classifications, see here. If you think classifying for example a versicolor wrongly is very harmful, you want to penalize such a classification in our example, you can do that easily – you just have to choose a different classifier, namely the “Cost-sensitive classifier” in Weka:

csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(`cost-matrix` = matrix(c(0, 
    10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48", 
    M = TRUE))

But you have to tell the “cost-sensitive-classifier” that you want to use J48 as algorithm, and you have to tell him the cost matrix you want to apply, name ly the matrix of the form

matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    1    0    1
## [3,]    0    0    0

where you penalize “versicolor” being falsly classified as one of the others by factor 10.

And again we evaluate on 10-fold CV:

eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE, 
    seed = 1, class = TRUE)
eval_csc

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances          98               65.3333 %
## Incorrectly Classified Instances        52               34.6667 %
## Kappa statistic                          0.48  
## Mean absolute error                      0.2311
## Root mean squared error                  0.4807
## Relative absolute error                 52      %
## Root relative squared error            101.9804 %
## Coverage of cases (0.95 level)          65.3333 %
## Mean rel. region size (0.95 level)      33.3333 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.070    0.875      0.980    0.925      0.887    0.955     0.864     setosa
##                  0.980    0.450    0.521      0.980    0.681      0.517    0.765     0.518     versicolor
##                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.333     virginica
## Weighted Avg.    0.653    0.173    0.465      0.653    0.535      0.468    0.740     0.572     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   1 49  0 |  b = versicolor
##   6 44  0 |  c = virginica

and we see that the “versicolors” are now better predicted (only one wrong, compared to 3 in the normal J48 earlier). But this happened at the expense of more fals classification on “virginica”, where we have now 6 wrongly classified instead of 2.

Alright, this is just a short starter. I suggest you check out the very good introductions I referred to earlier to explore the full wealth of RWeka… Have fun!

Change fonts in ggplot2, and create xkcd style graphs

Installing and changing fonts in your plots comes now easy with the extrafonts-package.

There is a excellent tutorial on the extrafonts github site, still I will shortly demonstrate how it worked for me.

First, install the package and load it.

install.packages("extrafont");
library(extrafont)

You can now install the desired system fonts (at the moment only TrueType fonts):

font_import(pattern="[C/c]omic")
font_import(pattern="[A/a]rial")

The pattern argument just specifies the fonts to be installed; if you leave it out, the function will search automatically and install all fonts (see the help function for font_import in R.

You can now look at the fonts loaded to be used with R:

fonts()
fonttable()

Alright, if you want to use the fonts to display graphs on-screen (what you normally want to do, e.g. in the plots-window in RStudio), you have to load the fonts to the Windows device:

loadfonts(device="win")

Now we can just plot a graph. I will use the ggplot2 package to do so (and this is taken nearly verbatim from the github-site mentioned earlier) :

library(ggplot2)
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() +
 ggtitle("Fuel Efficiency of 32 Cars") +
  xlab("Weight (x1000 lb)") + ylab("Miles per Gallon") +
  theme(text=element_text(size=16, family="Comic Sans MS"))

Let’s now draw a chart using the xkcd-style. This has been debated on stackoverflow, and the guys there have come up with great examples! I basically just take their approach and modify it slightly to draw a graph on US GDP (gross domestic product) development. I download the data from the St. Louis Fed website, using the excellent quantmod package.
But first I need to get the font in xkcd style, and I’ve chosen to download it from here.
You now just have to install it, by right-clicking on the .ttf file and chose install.

font_import(pattern="[H/h]umor")
library(quantmod)
getSymbols("GDPC1", src="FRED")data <- data.frame(time=index(GDPC1), GDP=GDPC1)

Now data is donwloaded, the font is installed.
Next step: Create a ggplot2-theme that has xkcd “properties” – and this is what I’ve basically copied from stackoverflow. Note: Unlike on stackoverflow, I basically use only the font. I do not jitter the line, since it is real data that is “jittered” anyway.
The rest is just ggplot2 plotting:

### XKCD theme
theme_xkcd <- theme(panel.background = element_rect(fill="white"),
 #axis.ticks = element_line(colour=NA),
panel.grid = element_line(colour="white"),
 #axis.text.y = element_text(colour=NA),
 axis.text.x = element_text(colour="black"),
  text = element_text(size=16, family="Humor Sans"))
### Plot the chart
ggplot(data=data, aes(x=time, y=GDPC1))+
 geom_line(colour="red", size=1)+
  ggtitle("development of US gross domestic product") +
 theme_xkcd()

If you want to save this chart now as pdf, you need to do another step: use ggsave. But you need to embed the fonts in the PDF, and therefore you have to install Ghostscript. Tell R where Ghostscript sits on your machine, and the just embed the fonts:

ggsave("font_ggplot.pdf", plot=p,  width=12, height=4)
## needed for Windows - make sure YOU have the correct path for your machine:
Sys.setenv(R_GSCMD = "C:\\Program Files (x86)\\gs\\gs9.06\\bin\\gswin32c.exe"
embed_fonts("font_ggplot.pdf")

There might be some warnings issued, but usually it works.

And finally: I do not take any credit for this work, it’s basically the developers of the <code>extrafont</code> package and Mark Bulling on stackoverflow who made all this possible. Any remaining errors are mine.

UPDATE (2013-07-14): I just realized there is the xkcd-package available on CRAN, that makes all of this a lot easier!

LOESS regression with R

The other day, I came a small problem: I was investigating a dataset, and the different variables clearly showed a non-linear behaviour.
Consequently, I could not apply the classical linear regression.
An approach to solve this kind of problem is LOESS regression, which stands for locally weighted scatterplot smoothing. More information can be found here.

In the following, we will work through an artificial example to make the concept clear and show how to apply in R.

We first create a dataframe with two variables, a and y, which are non-linearily related. The plot shows this relationship.

We perform two regressions, one linear and one loess.

## 
## Call:
## lm(formula = y ~ a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.303 -0.635  0.294  0.792  1.635 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.100335   0.136374    22.7   <2e-16 ***
## a           0.007645   0.000305    25.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.973 on 98 degrees of freedom
## Multiple R-squared: 0.865,   Adjusted R-squared: 0.864 
## F-statistic:  630 on 1 and 98 DF,  p-value: <2e-16

## Call:
## loess(formula = y ~ a)
## 
## Number of Observations: 100 
## Equivalent Number of Parameters: 4.56 
## Residual Standard Error: 0.587 
## Trace of smoother matrix: 4.98 
## 
## Control settings:
##   normalize:  TRUE 
##   span       :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2

The scatterplot clearly shows the better fit from the loess regression.
The summary of the regression shows one difference between linear and loess: The loess regression does not give the parameters, since it is a local regression. The classical measure of goodness of fit is r2.
The linear model has a r2 of 0.865. For the LOESS we have to calculate the r2 and follow this proposal.

## [1] 0.9528

The r2 from the loess is 0.953 and thus very good and better than the r2 from the linear regression.
Also the investigation of the plot of residuals vs fitted/predicted values indicates a much better fit of the LOSS regression compared to the linear regression (the residuals plot of the linear regression shows the structure – which we clearly do not want); the QQ-plot shows some issues on the tails.

But what to do now with it? Finally, we probably want to predict some values out of this exercise.
The predict-function helps us with this exercise.

##     a linreg loess
## 1  10  3.177 2.265
## 2 400  6.158 7.133
## 3 900  9.981 9.365

Alright, this was a quick introduction on how to apply the LOESS regression.

And here is the full code:

## @knitr create_data
y <- seq(from=1, to=10, length.out=100)
a <- y^3 +y^2  + rnorm(100,mean=0, sd=30)
data <- data.frame(a=a, y=y)
plot(y=y, x=a)

## @knitr linreg
linreg <- lm(y~a)
summary(linreg)

loess <- loess(y~a)
summary(loess)

scatter.smooth(data)
abline(linreg, col="blue")

## @knitr loess_fit
hat <- predict(loess)
plot(y~a)
lines(a[order(a)], hat[order(hat)], col="red")
(r_sq_loess <- cor(y, hat)^2)

## @knitr fit_loess
par(mfrow=c(2,2))
plot(linreg$fitted, linreg$residuals, main="classical linear regression")
plot(loess$fitted, loess$residuals, main="LOESS")
# normal probablility plot
qqnorm(linreg$residuals, ylim=c(-2,2)) 
qqline(linreg$residuals)
qqnorm(loess$residuals, ylim=c(-2,2)) 
qqline(loess$residuals)

## @knitr predict
predict <- data.frame(a=c(10,400,900))
scatter.smooth(data)
abline(linreg, col="blue")

predict$linreg <- predict(linreg, predict)
predict$loess <- predict(loess, predict)

predict

points(x=predict$a, y=predict(linreg, predict), col="blue", pch=18, cex=2)
points(x=predict$a, y=predict(loess, predict), col="red", pch=18, cex=2)

Created by Pretty R at inside-R.org

Basic ideas on aggregate, plyr and crosstables!

A common task using R is the investigation of one particular dataset. Usually we have a mixture of numerical and categorial data and are interested in some statistics (e.g. means and so on).
And there are a lot of threads, blogs etc around that. Sorry for adding another one, but so I remember myself.

Let’s take as example the dataset diamonds that is included in the ggplot2-package.

library(ggplot2)
data(diamonds)
str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We can see that the datsets consist of a mixture of variales and types.
Let’s say we are interest in the average price of diamonds with regards to the category of clarity.

Do it with agggregate
Aggregate comes with the stats package. The relevant call to answer our questions could look like

aggregate(price ~ clarity, diamonds, mean)

##   clarity price
## 1      I1  3924
## 2     SI2  5063
## 3     SI1  3996
## 4     VS2  3925
## 5     VS1  3839
## 6    VVS2  3284
## 7    VVS1  2523
## 8      IF  2865

Et voilà. Other way of typing would be the usage of the “list” argument as below.
We can also enlarge the list argument by other categorical variables, as shown as well, using clarity and color.

aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity), 
    mean)

##   CLARITY    x
## 1      I1 3924
## 2     SI2 5063
## 3     SI1 3996
## 4     VS2 3925
## 5     VS1 3839
## 6    VVS2 3284
## 7    VVS1 2523
## 8      IF 2865

aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity, 
    COLOR = diamonds$color), mean)

##    CLARITY COLOR    x
## 1       I1     D 3863
## 2      SI2     D 3931
## 3      SI1     D 2976
## 4      VS2     D 2587
## 5      VS1     D 3030
## 6     VVS2     D 3351
## 7     VVS1     D 2948
## 8       IF     D 8307
## 9       I1     E 3488
## 10     SI2     E 4174
## 11     SI1     E 3162
## 12     VS2     E 2751
## 13     VS1     E 2856
## 14    VVS2     E 2500
## 15    VVS1     E 2220
## 16      IF     E 3669
## 17      I1     F 3342
## 18     SI2     F 4473
## 19     SI1     F 3714
## 20     VS2     F 3757
## 21     VS1     F 3797
## 22    VVS2     F 3476
## 23    VVS1     F 2804
## 24      IF     F 2751
## 25      I1     G 3546
## 26     SI2     G 5022
## 27     SI1     G 3775
## 28     VS2     G 4416
## 29     VS1     G 4131
## 30    VVS2     G 3845
## 31    VVS1     G 2867
## 32      IF     G 2558
## 33      I1     H 4453
## 34     SI2     H 6100
## 35     SI1     H 5032
## 36     VS2     H 4722
## 37     VS1     H 3781
## 38    VVS2     H 2649
## 39    VVS1     H 1846
## 40      IF     H 2288
## 41      I1     I 4302
## 42     SI2     I 7003
## 43     SI1     I 5355
## 44     VS2     I 5691
## 45     VS1     I 4633
## 46    VVS2     I 2968
## 47    VVS1     I 2035
## 48      IF     I 1995
## 49      I1     J 5254
## 50     SI2     J 6521
## 51     SI1     J 5186
## 52     VS2     J 5311
## 53     VS1     J 4884
## 54    VVS2     J 5142
## 55    VVS1     J 4034
## 56      IF     J 3364

We clearly achieve the same result using the above “formula notation”:

aggregate(price ~ clarity + color, diamonds, mean)

##    clarity color price
## 1       I1     D  3863
## 2      SI2     D  3931
## 3      SI1     D  2976
## 4      VS2     D  2587
## 5      VS1     D  3030
## 6     VVS2     D  3351
## 7     VVS1     D  2948
## 8       IF     D  8307
## 9       I1     E  3488
## 10     SI2     E  4174
## 11     SI1     E  3162
## 12     VS2     E  2751
## 13     VS1     E  2856
## 14    VVS2     E  2500
## 15    VVS1     E  2220
## 16      IF     E  3669
## 17      I1     F  3342
## 18     SI2     F  4473
## 19     SI1     F  3714
## 20     VS2     F  3757
## 21     VS1     F  3797
## 22    VVS2     F  3476
## 23    VVS1     F  2804
## 24      IF     F  2751
## 25      I1     G  3546
## 26     SI2     G  5022
## 27     SI1     G  3775
## 28     VS2     G  4416
## 29     VS1     G  4131
## 30    VVS2     G  3845
## 31    VVS1     G  2867
## 32      IF     G  2558
## 33      I1     H  4453
## 34     SI2     H  6100
## 35     SI1     H  5032
## 36     VS2     H  4722
## 37     VS1     H  3781
## 38    VVS2     H  2649
## 39    VVS1     H  1846
## 40      IF     H  2288
## 41      I1     I  4302
## 42     SI2     I  7003
## 43     SI1     I  5355
## 44     VS2     I  5691
## 45     VS1     I  4633
## 46    VVS2     I  2968
## 47    VVS1     I  2035
## 48      IF     I  1995
## 49      I1     J  5254
## 50     SI2     J  6521
## 51     SI1     J  5186
## 52     VS2     J  5311
## 53     VS1     J  4884
## 54    VVS2     J  5142
## 55    VVS1     J  4034
## 56      IF     J  3364

Since I find the formula quotation much more intuitive I’ll continue using that one.

It allows also to pass further arguments, like the probabilities for the quantile-function:

aggregate(price ~ clarity + color, diamonds, FUN = quantile, probs = 0.8)

##    clarity color price
## 1       I1     D  5139
## 2      SI2     D  5257
## 3      SI1     D  5096
## 4      VS2     D  3305
## 5      VS1     D  4057
## 6     VVS2     D  4477
## 7     VVS1     D  3781
## 8       IF     D 16015
## 9       I1     E  4191
## 10     SI2     E  5150
## 11     SI1     E  5068
## 12     VS2     E  3585
## 13     VS1     E  3466
## 14    VVS2     E  2857
## 15    VVS1     E  2514
## 16      IF     E  6002
## 17      I1     F  4570
## 18     SI2     F  5163
## 19     SI1     F  5292
## 20     VS2     F  6598
## 21     VS1     F  7265
## 22    VVS2     F  8218
## 23    VVS1     F  3501
## 24      IF     F  2656
## 25      I1     G  6209
## 26     SI2     G  6785
## 27     SI1     G  5342
## 28     VS2     G  7031
## 29     VS1     G  7508
## 30    VVS2     G  7952
## 31    VVS1     G  3327
## 32      IF     G  2575
## 33      I1     H  6530
## 34     SI2     H  9257
## 35     SI1     H  7173
## 36     VS2     H  7510
## 37     VS1     H  6796
## 38    VVS2     H  4269
## 39    VVS1     H  2304
## 40      IF     H  2449
## 41      I1     I  5696
## 42     SI2     I 12918
## 43     SI1     I  8736
## 44     VS2     I  9791
## 45     VS1     I  8383
## 46    VVS2     I  3934
## 47    VVS1     I  2549
## 48      IF     I  2062
## 49      I1     J  6576
## 50     SI2     J 11865
## 51     SI1     J  7952
## 52     VS2     J  8055
## 53     VS1     J  8507
## 54    VVS2     J  8770
## 55    VVS1     J  8803
## 56      IF     J  5361

However, we face one issue here: For “at a glance” investigation, the output is not easily readable because it is in “long-format” (is this the right expression?). What I’d like to have, is a kind of cross-table. The MS-Excel user might recognize this as “pivot tables”.
For this purpose we have to make us of the xtabs-function that creates a contingency table.
First, we create a dataframe using our aggregate function and then we pass this dataframe to xtabs, specifying that we want to see the price on the other variables:

dmnds <- aggregate(price ~ clarity + color, diamonds, mean)
xtabs(price ~ ., data = dmnds)

##        color
## clarity    D    E    F    G    H    I    J
##    I1   3863 3488 3342 3546 4453 4302 5254
##    SI2  3931 4174 4473 5022 6100 7003 6521
##    SI1  2976 3162 3714 3775 5032 5355 5186
##    VS2  2587 2751 3757 4416 4722 5691 5311
##    VS1  3030 2856 3797 4131 3781 4633 4884
##    VVS2 3351 2500 3476 3845 2649 2968 5142
##    VVS1 2948 2220 2804 2867 1846 2035 4034
##    IF   8307 3669 2751 2558 2288 1995 3364

That’s nice, isn’t it?

But as announced earlier, there is another way of achieving the same result: The ddply-function from the plyr-package.

dmnds.ddply <- ddply(.data = diamonds, .(clarity, color), summarise, 
    PRICEMEAN = mean(price))
dmnds.ddply

##    clarity color PRICEMEAN
## 1       I1     D      3863
## 2       I1     E      3488
## 3       I1     F      3342
## 4       I1     G      3546
## 5       I1     H      4453
## 6       I1     I      4302
## 7       I1     J      5254
## 8      SI2     D      3931
## 9      SI2     E      4174
## 10     SI2     F      4473
## 11     SI2     G      5022
## 12     SI2     H      6100
## 13     SI2     I      7003
## 14     SI2     J      6521
## 15     SI1     D      2976
## 16     SI1     E      3162
## 17     SI1     F      3714
## 18     SI1     G      3775
## 19     SI1     H      5032
## 20     SI1     I      5355
## 21     SI1     J      5186
## 22     VS2     D      2587
## 23     VS2     E      2751
## 24     VS2     F      3757
## 25     VS2     G      4416
## 26     VS2     H      4722
## 27     VS2     I      5691
## 28     VS2     J      5311
## 29     VS1     D      3030
## 30     VS1     E      2856
## 31     VS1     F      3797
## 32     VS1     G      4131
## 33     VS1     H      3781
## 34     VS1     I      4633
## 35     VS1     J      4884
## 36    VVS2     D      3351
## 37    VVS2     E      2500
## 38    VVS2     F      3476
## 39    VVS2     G      3845
## 40    VVS2     H      2649
## 41    VVS2     I      2968
## 42    VVS2     J      5142
## 43    VVS1     D      2948
## 44    VVS1     E      2220
## 45    VVS1     F      2804
## 46    VVS1     G      2867
## 47    VVS1     H      1846
## 48    VVS1     I      2035
## 49    VVS1     J      4034
## 50      IF     D      8307
## 51      IF     E      3669
## 52      IF     F      2751
## 53      IF     G      2558
## 54      IF     H      2288
## 55      IF     I      1995
## 56      IF     J      3364

Well, perhaps a bit verbose compared to the use of aggregate, but ddply allows great flexibility!
In order to bring it now into a “pivot table”, we can make use of (a) the xtabs-function again, or (b) use the cast-function in the reshape2-package (NB: Use dcast for dataframes):

dcast(dmnds.ddply, clarity ~ color, value.var = "PRICEMEAN")

##   clarity    D    E    F    G    H    I    J
## 1      I1 3863 3488 3342 3546 4453 4302 5254
## 2     SI2 3931 4174 4473 5022 6100 7003 6521
## 3     SI1 2976 3162 3714 3775 5032 5355 5186
## 4     VS2 2587 2751 3757 4416 4722 5691 5311
## 5     VS1 3030 2856 3797 4131 3781 4633 4884
## 6    VVS2 3351 2500 3476 3845 2649 2968 5142
## 7    VVS1 2948 2220 2804 2867 1846 2035 4034
## 8      IF 8307 3669 2751 2558 2288 1995 3364

Again, we have a “formula interface” which makes the application straightforward. The parameter “value.var” specifies which column to take to apply the function to – very helpful in case of big datasets.

Alright, and last but not least a oneliner that achieves the same result, combining the melt-function and the cast-function (the melt-function is part of the reshape2 package and converts “datatables” in long format.)

dcast(melt(diamonds, id.vars = c("cut", "color", "clarity"), measure.vars = "price"), 
    formula = clarity ~ color, mean)

##   clarity    D    E    F    G    H    I    J
## 1      I1 3863 3488 3342 3546 4453 4302 5254
## 2     SI2 3931 4174 4473 5022 6100 7003 6521
## 3     SI1 2976 3162 3714 3775 5032 5355 5186
## 4     VS2 2587 2751 3757 4416 4722 5691 5311
## 5     VS1 3030 2856 3797 4131 3781 4633 4884
## 6    VVS2 3351 2500 3476 3845 2649 2968 5142
## 7    VVS1 2948 2220 2804 2867 1846 2035 4034
## 8      IF 8307 3669 2751 2558 2288 1995 3364

So first, the dataset is brought entirely in long format, and then cast is applied. But this clearly is the most time-consuming way of doing it (in terms of computation time)…

Bottom line:

My favorite function for not too big datasets is the combination of ddply and cast, sine it allows for most flexibility, I feel. That flexibility come with the price of syntax, which you’ll have to learn. Aggregate and xtabs are quickly applied for small problems (maybe also for big ones, but I do not understand how to apply for more complex tasks that ddply allows).
Additionally, I’ve heard talking about the data.table package, that seems to be the fastest. Go and check!

References:

mages’ blog on by, apply and friends

ggplot2 – much easier with JGR and Deducer

In the last R-User meeting in Cologne, we had a discussion about using ggplot2 – and I gave a short introduction of how to use ggplot2 with JGR and Deducer.

Basically, JGR is a Graphical User Interface for R, and Deducer is a kind of “data analysis plugin”, that also comes with a so-called “plot builder” running ggplot2.

This feature allows users to access the power of ggplot2 through a simple user interface, which makes the start with ggplot2 much easier.

The attached slide pack gives a short introduction (unfortunately only in german), but in order to see Deducer in action, I recommend to have a look at excellent tutorials like

Introductory tour of Deducer

Deducer’s plot builder – Part 1

The slides below

Slides

As well as the source code of the slides, in .Rmd format.

Using twitteR to see, what german press secretary tweets about

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

What this is about

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

  • What devices does he use to tweet?
  • What is the frequency of his tweets, and on which days was he most active?
  • Is there any particular pattern about when he tweets?
  • What is he tweeting about?

Load data

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

Which device does he use?

  • Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

Analysis of frequency

  • The tweets are summarized by day, and the bars show the amount of tweets per day
  • There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
  • Normal “twee-rate” seems to be between 3 and 10 per day
  • There are even some “idle” days, where he did not tweet at all – holidays?

At what time of day is he tweeting?

  • The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
  • The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 
  • The most frequent terms are shown above ; they occur each more than 30 times.
  • Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
  • We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
  • The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
  • Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany ;-)

——————————————————–
The Code below:

## @knitr setup
library(ggplot2)
library(scales)
library(lubridate)
library(twitteR)
library(gridExtra)
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE)
# upload images automatically
opts_knit$set(upload.fun = imgur_upload)

## @knitr load_data
# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...
#str(regsprecher.tweets.df)
print(head(regsprecher.tweets.df[,c(1,4,10)]))

## @knitr device
# Code from vignette of twitteR-package
  sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource())
  sources <- gsub("", "", sources)
  sources <- strsplit(sources, ">")
  sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
  pie(table(sources))

## @knitr freq
  ggplot() +
 geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) +
 scale_y_continuous(name = 'Frequency, # tweets/day') +
 scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b'))

## @knitr time
plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day')
plot2 <- ggplot() +
 geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) +
 scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20,
 22,2,4,24)),limits = c(0,24)) +
 scale_y_continuous(name = '# tweets')
grid.arrange(plot1, plot2, ncol=2)

## @knitr words
# this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
require(tm)
# build a corpus
mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 
# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
#mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

Created by Pretty R at inside-R.org

Displaying german stock performance with R using ggplot2

I cannot follow stock market developments daily, so I was looking for a quick overview of what had happened in the last week. What would be of interest for me is  “How did German stocks perform over the last 5 days, compared to the last 20 trading days and the last 250 trading days”.

R in combination with the right packages delivers a quick answer.

The result is a picture like this:

Performance Plot

Plot of German stock performance

Values in the “upper right” quadrant stand for shares, that did show positive performance during the last 5 and 20 trading days. The color displays the performance over the last 250 trading days.

We can see that the last week was rather positive, all of the shares except one have positive performance.  Also the last 20 days were profitable for most shares.

However, it can be seen that a lot of stocks did suffer during the last year (i.e. 250 trading days), which van bee seen from the colors – most of them are red. Most notably CBK.DE (Commerzbank) did nearly loose all of its value…

You can find the R code here:

library(quantmod)
library(ggplot2)
library(zoo)

#get list of Symbols for DAX-values
l<- c("^GDAXI", "DB1.DE", "ADS.DE", "ALV.DE", "BAS.DE","BAYN.DE","BEI.DE","BMW.DE","CBK.DE","DAI.DE","DBK.DE","DPW.DE","DTE.DE","EOAN.DE","FME.DE","FRE.DE","HEI.DE","HEN3.DE","IFX.DE","LHA.DE","LIN.DE","MAN.DE","MEO.DE","MUV2.DE","RWE.DE","SAP.DE","SDF.DE","SIE.DE","TKA.DE","VOW3.DE")
getSymbols(l, from="2010-09-01")
l[1] <- "GDAXI"

# Function to extract "adjusted prices" and build dataframe: Thanks to Zach Mayer of moderntoolmaking.blogspot.com
symbolFrame <- function(symbolList) {
Data <- data.frame(NULL)
for (S in symbolList) {
Data <- cbind(Data,Ad(get(S)))
}
colnames(Data) <- symbolList
return(Data)

}

Data <- symbolFrame(l[-1]) # build a dataframe without DAX istelf
Data <- cbind(Ad(GDAXI), Data) # add DAX
colnames(Data)[1] <- "DAX"
tail(Data,2) #just to check - often Yahoo is not up to date and there are NAs in the last row
#Data <- window(Data, start=start(Data), end=end(Data)-1) # code to delete last row...

Return.calculate(Data, method="simple") -> Data.r #calculates the returns (simple)
Data.r[is.na(Data.r)] <- 0 

#builds frames for the respective perfromances on short, mid and long term
mid.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,20)),1)))
short.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,5)),1)))
long.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,250)),1)))

per.df <- data.frame(cbind(t(short.perf), t(mid.perf), t(long.perf)))

colnames(per.df) <- c("short", "mid", "long")
row.names(per.df)[1] <- "DAX"
chart_title <- paste("Performance Comparison DAX values\n(latest data close of ",end(Data),")")
ggplot(data=per.df, aes(short, mid, label=rownames(per.df))) + geom_point(aes(color=long), size=4) +
  geom_text(hjust=0, vjust=0,size=4) + geom_vline(xintercept=0) + geom_hline(yintercept=0) +
  scale_colour_gradient2(low="red", high="green", "250days\nPerformance") +
  scale_y_continuous("Mid Performance: 20days", formatter="percent") +
  scale_x_continuous("Short Performance: 5days", formatter="percent") +
  opts(title = chart_title)

Created by Pretty R at inside-R.org

Follow

Get every new post delivered to your Inbox.