- Yanchang’s website with Examples and a nice reference card
- The rattle-package that introduces a nice GUI for R, and Graham William’s compendium of tools
- The caret-package that offers a unified interface to running a multitude of model builders.

And there are other tools out there for data mining, like Weka.

Weka has a GUI and can be directed via the command line with Java as well, and Weka has a large variety of algorithms included. If, for whatever reason, you do not find the algorithm you need being implemented in R, Weka might be the place to go. And the RWeka-package marries R and Weka.

I am not an expert neither in R, nor in Weka, nor in data mining. But I happen to play around with them, and I’d like to share a starter on how to work with them. There is good documentation out there (e.g. Open-Source Machine Learning: R Meets Weka or RWeka Odds and Ends), but sometimes you want to document your own steps and ways of working, and this is what I do.

So, I want to build a classification model for the iris-dataset, based on a tree classifier. Joice is the C4.5 algorithm that I did not find implemented in any standard R package.

**UPDATE**: “Z” made the comment below, that the C5.0-algorithm is implemented in the C50-package. And indeed, Max Kuhn has given a nice presentation about it at UseR! 2013, that you can find here.

We want to predict the class of a flower based on their attributes, namely sepal and petal width and length. The three species we have are “setosa”, “versicolor” and “virginica”. A short summary is given above.

We next load the RWeka package.

`summary(iris)`

`## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 ## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 ## Median :5.80 Median :3.00 Median :4.35 Median :1.3 ## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 ## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 ## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##`

`library(RWeka)`

We now build the classifier, and this works with the J48(.)-function:

`iris_j48 <- J48(Species ~ ., data = iris) iris_j48`

`## J48 pruned tree ## ------------------ ## ## Petal.Width <= 0.6: setosa (50.0) ## Petal.Width > 0.6 ## | Petal.Width <= 1.7 ## | | Petal.Length <= 4.9: versicolor (48.0/1.0) ## | | Petal.Length > 4.9 ## | | | Petal.Width <= 1.5: virginica (3.0) ## | | | Petal.Width > 1.5: versicolor (3.0/1.0) ## | Petal.Width > 1.7: virginica (46.0/1.0) ## ## Number of Leaves : 5 ## ## Size of the tree : 9`

`summary(iris_j48)`

`## ## === Summary === ## ## Correctly Classified Instances 147 98 % ## Incorrectly Classified Instances 3 2 % ## Kappa statistic 0.97 ## Mean absolute error 0.0233 ## Root mean squared error 0.108 ## Relative absolute error 5.2482 % ## Root relative squared error 22.9089 % ## Coverage of cases (0.95 level) 98.6667 % ## Mean rel. region size (0.95 level) 34 % ## Total Number of Instances 150 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 50 0 0 | a = setosa ## 0 49 1 | b = versicolor ## 0 2 48 | c = virginica`

`# plot(iris_j48) library(partykit) plot(as.party(iris_j48)) # we use the partykit-package for nice plotting.`

We can assign the model to an object, and printing the object gives us the tree in “Weka-Output”, summary(.) gives us the Summary of the classification on the training set (again, in Weka-style), and plot(.) allows us to nicely plot it, while plot.party(.) gives us an even nicer plot (thanks to Z for his comment below).

Well, we used the whole dataset now for training, but we actually might want to perform cross-validation. This can be done like this:

`eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE) eval_j48`

`## === 10 Fold Cross Validation === ## ## === Summary === ## ## Correctly Classified Instances 144 96 % ## Incorrectly Classified Instances 6 4 % ## Kappa statistic 0.94 ## Mean absolute error 0.035 ## Root mean squared error 0.1586 ## Relative absolute error 7.8705 % ## Root relative squared error 33.6353 % ## Coverage of cases (0.95 level) 96.6667 % ## Mean rel. region size (0.95 level) 33.7778 % ## Total Number of Instances 150 ## ## === Detailed Accuracy By Class === ## ## TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class ## 0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 setosa ## 0.940 0.030 0.940 0.940 0.940 0.910 0.952 0.880 versicolor ## 0.960 0.030 0.941 0.960 0.950 0.925 0.961 0.905 virginica ## Weighted Avg. 0.960 0.020 0.960 0.960 0.960 0.940 0.968 0.924 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 49 1 0 | a = setosa ## 0 47 3 | b = versicolor ## 0 2 48 | c = virginica`

We see slightly worse results now, as you would suspect.

We used the standard options for th J48 classifier, but Weka allows more. You can acces these with the WOW-function:

`WOW("J48")`

`## -U Use unpruned tree. ## -O Do not collapse tree. ## -C <pruning confidence> ## Set confidence threshold for pruning. (default 0.25) ## Number of arguments: 1. ## -M <minimum number of instances> ## Set minimum number of instances per leaf. (default 2) ## Number of arguments: 1. ## -R Use reduced error pruning. ## -N <number of folds> ## Set number of folds for reduced error pruning. One fold is ## used as pruning set. (default 3) ## Number of arguments: 1. ## -B Use binary splits only. ## -S Don't perform subtree raising. ## -L Do not clean up after the tree has been built. ## -A Laplace smoothing for predicted probabilities. ## -J Do not use MDL correction for info gain on numeric ## attributes. ## -Q <seed> ## Seed for random data shuffling (default 1). ## Number of arguments: 1.`

If, for example, we want to use a tree with minimum 10 instances in each leaf, we change the command as follows:

`j48_control <- J48(Species ~ ., data = iris, control = Weka_control(M = 10)) j48_control`

`## J48 pruned tree ## ------------------ ## ## Petal.Width <= 0.6: setosa (50.0) ## Petal.Width > 0.6 ## | Petal.Width <= 1.7: versicolor (54.0/5.0) ## | Petal.Width > 1.7: virginica (46.0/1.0) ## ## Number of Leaves : 3 ## ## Size of the tree : 5`

And you see the tree is different (well, it just does not go as deep as the other one..).

You might want to include a cost matrix, i.e you want to penalize some wrong classifications, see here. If you think classifying for example a versicolor wrongly is very harmful, you want to penalize such a classification in our example, you can do that easily – you just have to choose a different classifier, namely the “Cost-sensitive classifier” in Weka:

```
csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(`cost-matrix` = matrix(c(0,
10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48",
M = TRUE))
```

But you have to tell the “cost-sensitive-classifier” that you want to use J48 as algorithm, and you have to tell him the cost matrix you want to apply, name ly the matrix of the form

`matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3)`

`## [,1] [,2] [,3] ## [1,] 0 0 0 ## [2,] 1 0 1 ## [3,] 0 0 0`

where you penalize “versicolor” being falsly classified as one of the others by factor 10.

And again we evaluate on 10-fold CV:

`eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE) eval_csc`

`## === 10 Fold Cross Validation === ## ## === Summary === ## ## Correctly Classified Instances 98 65.3333 % ## Incorrectly Classified Instances 52 34.6667 % ## Kappa statistic 0.48 ## Mean absolute error 0.2311 ## Root mean squared error 0.4807 ## Relative absolute error 52 % ## Root relative squared error 101.9804 % ## Coverage of cases (0.95 level) 65.3333 % ## Mean rel. region size (0.95 level) 33.3333 % ## Total Number of Instances 150 ## ## === Detailed Accuracy By Class === ## ## TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class ## 0.980 0.070 0.875 0.980 0.925 0.887 0.955 0.864 setosa ## 0.980 0.450 0.521 0.980 0.681 0.517 0.765 0.518 versicolor ## 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.333 virginica ## Weighted Avg. 0.653 0.173 0.465 0.653 0.535 0.468 0.740 0.572 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 49 1 0 | a = setosa ## 1 49 0 | b = versicolor ## 6 44 0 | c = virginica`

and we see that the “versicolors” are now better predicted (only one wrong, compared to 3 in the normal J48 earlier). But this happened at the expense of more fals classification on “virginica”, where we have now 6 wrongly classified instead of 2.

Alright, this is just a short starter. I suggest you check out the very good introductions I referred to earlier to explore the full wealth of RWeka… Have fun!

]]>There is a excellent tutorial on the extrafonts github site, still I will shortly demonstrate how it worked for me.

First, install the package and load it.

install.packages("extrafont"); library(extrafont)

You can now install the desired system fonts (at the moment only TrueType fonts):

font_import(pattern="[C/c]omic") font_import(pattern="[A/a]rial")

The `pattern`

argument just specifies the fonts to be installed; if you leave it out, the function will search automatically and install all fonts (see the `help`

function for `font_import`

in R.

You can now look at the fonts loaded to be used with R:

fonts() fonttable()

Alright, if you want to use the fonts to display graphs on-screen (what you normally want to do, e.g. in the plots-window in RStudio), you have to load the fonts to the Windows device:

loadfonts(device="win")

Now we can just plot a graph. I will use the ggplot2 package to do so (and this is taken nearly verbatim from the github-site mentioned earlier) :

library(ggplot2) ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() + ggtitle("Fuel Efficiency of 32 Cars") + xlab("Weight (x1000 lb)") + ylab("Miles per Gallon") + theme(text=element_text(size=16, family="Comic Sans MS"))

Let’s now draw a chart using the xkcd-style. This has been debated on stackoverflow, and the guys there have come up with great examples! I basically just take their approach and modify it slightly to draw a graph on US GDP (gross domestic product) development. I download the data from the St. Louis Fed website, using the excellent `quantmod`

package.

But first I need to get the font in xkcd style, and I’ve chosen to download it from here.

You now just have to install it, by right-clicking on the .ttf file and chose install.

font_import(pattern="[H/h]umor") library(quantmod) getSymbols("GDPC1", src="FRED")data <- data.frame(time=index(GDPC1), GDP=GDPC1)

Now data is donwloaded, the font is installed.

Next step: Create a ggplot2-theme that has xkcd “properties” – and this is what I’ve basically copied from stackoverflow. Note: Unlike on stackoverflow, I basically use only the font. I do not jitter the line, since it is real data that is “jittered” anyway.

The rest is just ggplot2 plotting:

### XKCD theme theme_xkcd <- theme(panel.background = element_rect(fill="white"), #axis.ticks = element_line(colour=NA), panel.grid = element_line(colour="white"), #axis.text.y = element_text(colour=NA), axis.text.x = element_text(colour="black"), text = element_text(size=16, family="Humor Sans")) ### Plot the chart ggplot(data=data, aes(x=time, y=GDPC1))+ geom_line(colour="red", size=1)+ ggtitle("development of US gross domestic product") + theme_xkcd()

If you want to save this chart now as pdf, you need to do another step: use `ggsave`

. But you need to embed the fonts in the PDF, and therefore you have to install Ghostscript. Tell R where Ghostscript sits on your machine, and the just embed the fonts:

ggsave("font_ggplot.pdf", plot=p, width=12, height=4) ## needed for Windows - make sure YOU have the correct path for your machine: Sys.setenv(R_GSCMD = "C:\\Program Files (x86)\\gs\\gs9.06\\bin\\gswin32c.exe" embed_fonts("font_ggplot.pdf")

There might be some warnings issued, but usually it works.

And finally: I do not take any credit for this work, it’s basically the developers of the <code>extrafont</code> package and Mark Bulling on stackoverflow who made all this possible. Any remaining errors are mine.

**UPDATE (2013-07-14): I just realized there is the xkcd-package available on CRAN, that makes all of this a lot easier!**

Consequently, I could not apply the classical linear regression.

An approach to solve this kind of problem is LOESS regression, which stands for

In the following, we will work through an artificial example to make the concept clear and show how to apply in R.

We first create a dataframe with two variables, a and y, which are non-linearily related. The plot shows this relationship.

We perform two regressions, one linear and one loess.

`## ## Call: ## lm(formula = y ~ a) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.303 -0.635 0.294 0.792 1.635 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.100335 0.136374 22.7 <2e-16 *** ## a 0.007645 0.000305 25.1 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.973 on 98 degrees of freedom ## Multiple R-squared: 0.865, Adjusted R-squared: 0.864 ## F-statistic: 630 on 1 and 98 DF, p-value: <2e-16`

`## Call: ## loess(formula = y ~ a) ## ## Number of Observations: 100 ## Equivalent Number of Parameters: 4.56 ## Residual Standard Error: 0.587 ## Trace of smoother matrix: 4.98 ## ## Control settings: ## normalize: TRUE ## span : 0.75 ## degree : 2 ## family : gaussian ## surface : interpolate cell = 0.2`

The scatterplot clearly shows the better fit from the loess regression.

The summary of the regression shows one difference between linear and loess: The loess regression does not give the parameters, since it is a local regression. The classical measure of goodness of fit is r2.

The linear model has a r2 of `0.865`

. For the LOESS we have to calculate the r2 and follow this proposal.

```
## [1] 0.9528
```

The r^{2} from the loess is `0.953`

and thus very good and better than the r^{2} from the linear regression.

Also the investigation of the plot of residuals vs fitted/predicted values indicates a much better fit of the LOSS regression compared to the linear regression (the residuals plot of the linear regression shows the structure – which we clearly do not want); the QQ-plot shows some issues on the tails.

But what to do now with it? Finally, we probably want to *predict* some values out of this exercise.

The *predict*-function helps us with this exercise.

```
## a linreg loess
## 1 10 3.177 2.265
## 2 400 6.158 7.133
## 3 900 9.981 9.365
```

Alright, this was a quick introduction on how to apply the LOESS regression.

And here is the full code:

## @knitr create_data y <- seq(from=1, to=10, length.out=100) a <- y^3 +y^2 + rnorm(100,mean=0, sd=30) data <- data.frame(a=a, y=y) plot(y=y, x=a) ## @knitr linreg linreg <- lm(y~a) summary(linreg) loess <- loess(y~a) summary(loess) scatter.smooth(data) abline(linreg, col="blue") ## @knitr loess_fit hat <- predict(loess) plot(y~a) lines(a[order(a)], hat[order(hat)], col="red") (r_sq_loess <- cor(y, hat)^2) ## @knitr fit_loess par(mfrow=c(2,2)) plot(linreg$fitted, linreg$residuals, main="classical linear regression") plot(loess$fitted, loess$residuals, main="LOESS") # normal probablility plot qqnorm(linreg$residuals, ylim=c(-2,2)) qqline(linreg$residuals) qqnorm(loess$residuals, ylim=c(-2,2)) qqline(loess$residuals) ## @knitr predict predict <- data.frame(a=c(10,400,900)) scatter.smooth(data) abline(linreg, col="blue") predict$linreg <- predict(linreg, predict) predict$loess <- predict(loess, predict) predict points(x=predict$a, y=predict(linreg, predict), col="blue", pch=18, cex=2) points(x=predict$a, y=predict(loess, predict), col="red", pch=18, cex=2)

And there are a lot of threads, blogs etc around that. Sorry for adding another one, but so I remember myself.

Let’s take as example the dataset **diamonds** that is included in the **ggplot2**-package.

`library(ggplot2) data(diamonds) str(diamonds)`

`## 'data.frame': 53940 obs. of 10 variables: ## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... ## $ cut : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ... ## $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ... ## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... ## $ table : num 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int 326 326 327 334 335 336 336 337 337 338 ... ## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... ## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... ## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...`

We can see that the datsets consist of a mixture of variales and types.

Let’s say we are interest in the average *price* of diamonds with regards to the category of *clarity*.

Do it with **agggregate**

Aggregate comes with the stats package. The relevant call to answer our questions could look like

`aggregate(price ~ clarity, diamonds, mean)`

`## clarity price ## 1 I1 3924 ## 2 SI2 5063 ## 3 SI1 3996 ## 4 VS2 3925 ## 5 VS1 3839 ## 6 VVS2 3284 ## 7 VVS1 2523 ## 8 IF 2865`

Et voilà. Other way of typing would be the usage of the “list” argument as below.

We can also enlarge the list argument by other categorical variables, as shown as well, using *clarity* and *color*.

`aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity), mean)`

`## CLARITY x ## 1 I1 3924 ## 2 SI2 5063 ## 3 SI1 3996 ## 4 VS2 3925 ## 5 VS1 3839 ## 6 VVS2 3284 ## 7 VVS1 2523 ## 8 IF 2865`

`aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity, COLOR = diamonds$color), mean)`

`## CLARITY COLOR x ## 1 I1 D 3863 ## 2 SI2 D 3931 ## 3 SI1 D 2976 ## 4 VS2 D 2587 ## 5 VS1 D 3030 ## 6 VVS2 D 3351 ## 7 VVS1 D 2948 ## 8 IF D 8307 ## 9 I1 E 3488 ## 10 SI2 E 4174 ## 11 SI1 E 3162 ## 12 VS2 E 2751 ## 13 VS1 E 2856 ## 14 VVS2 E 2500 ## 15 VVS1 E 2220 ## 16 IF E 3669 ## 17 I1 F 3342 ## 18 SI2 F 4473 ## 19 SI1 F 3714 ## 20 VS2 F 3757 ## 21 VS1 F 3797 ## 22 VVS2 F 3476 ## 23 VVS1 F 2804 ## 24 IF F 2751 ## 25 I1 G 3546 ## 26 SI2 G 5022 ## 27 SI1 G 3775 ## 28 VS2 G 4416 ## 29 VS1 G 4131 ## 30 VVS2 G 3845 ## 31 VVS1 G 2867 ## 32 IF G 2558 ## 33 I1 H 4453 ## 34 SI2 H 6100 ## 35 SI1 H 5032 ## 36 VS2 H 4722 ## 37 VS1 H 3781 ## 38 VVS2 H 2649 ## 39 VVS1 H 1846 ## 40 IF H 2288 ## 41 I1 I 4302 ## 42 SI2 I 7003 ## 43 SI1 I 5355 ## 44 VS2 I 5691 ## 45 VS1 I 4633 ## 46 VVS2 I 2968 ## 47 VVS1 I 2035 ## 48 IF I 1995 ## 49 I1 J 5254 ## 50 SI2 J 6521 ## 51 SI1 J 5186 ## 52 VS2 J 5311 ## 53 VS1 J 4884 ## 54 VVS2 J 5142 ## 55 VVS1 J 4034 ## 56 IF J 3364`

We clearly achieve the same result using the above “formula notation”:

`aggregate(price ~ clarity + color, diamonds, mean)`

`## clarity color price ## 1 I1 D 3863 ## 2 SI2 D 3931 ## 3 SI1 D 2976 ## 4 VS2 D 2587 ## 5 VS1 D 3030 ## 6 VVS2 D 3351 ## 7 VVS1 D 2948 ## 8 IF D 8307 ## 9 I1 E 3488 ## 10 SI2 E 4174 ## 11 SI1 E 3162 ## 12 VS2 E 2751 ## 13 VS1 E 2856 ## 14 VVS2 E 2500 ## 15 VVS1 E 2220 ## 16 IF E 3669 ## 17 I1 F 3342 ## 18 SI2 F 4473 ## 19 SI1 F 3714 ## 20 VS2 F 3757 ## 21 VS1 F 3797 ## 22 VVS2 F 3476 ## 23 VVS1 F 2804 ## 24 IF F 2751 ## 25 I1 G 3546 ## 26 SI2 G 5022 ## 27 SI1 G 3775 ## 28 VS2 G 4416 ## 29 VS1 G 4131 ## 30 VVS2 G 3845 ## 31 VVS1 G 2867 ## 32 IF G 2558 ## 33 I1 H 4453 ## 34 SI2 H 6100 ## 35 SI1 H 5032 ## 36 VS2 H 4722 ## 37 VS1 H 3781 ## 38 VVS2 H 2649 ## 39 VVS1 H 1846 ## 40 IF H 2288 ## 41 I1 I 4302 ## 42 SI2 I 7003 ## 43 SI1 I 5355 ## 44 VS2 I 5691 ## 45 VS1 I 4633 ## 46 VVS2 I 2968 ## 47 VVS1 I 2035 ## 48 IF I 1995 ## 49 I1 J 5254 ## 50 SI2 J 6521 ## 51 SI1 J 5186 ## 52 VS2 J 5311 ## 53 VS1 J 4884 ## 54 VVS2 J 5142 ## 55 VVS1 J 4034 ## 56 IF J 3364`

Since I find the formula quotation much more intuitive I’ll continue using that one.

It allows also to pass further arguments, like the probabilities for the quantile-function:

`aggregate(price ~ clarity + color, diamonds, FUN = quantile, probs = 0.8)`

`## clarity color price ## 1 I1 D 5139 ## 2 SI2 D 5257 ## 3 SI1 D 5096 ## 4 VS2 D 3305 ## 5 VS1 D 4057 ## 6 VVS2 D 4477 ## 7 VVS1 D 3781 ## 8 IF D 16015 ## 9 I1 E 4191 ## 10 SI2 E 5150 ## 11 SI1 E 5068 ## 12 VS2 E 3585 ## 13 VS1 E 3466 ## 14 VVS2 E 2857 ## 15 VVS1 E 2514 ## 16 IF E 6002 ## 17 I1 F 4570 ## 18 SI2 F 5163 ## 19 SI1 F 5292 ## 20 VS2 F 6598 ## 21 VS1 F 7265 ## 22 VVS2 F 8218 ## 23 VVS1 F 3501 ## 24 IF F 2656 ## 25 I1 G 6209 ## 26 SI2 G 6785 ## 27 SI1 G 5342 ## 28 VS2 G 7031 ## 29 VS1 G 7508 ## 30 VVS2 G 7952 ## 31 VVS1 G 3327 ## 32 IF G 2575 ## 33 I1 H 6530 ## 34 SI2 H 9257 ## 35 SI1 H 7173 ## 36 VS2 H 7510 ## 37 VS1 H 6796 ## 38 VVS2 H 4269 ## 39 VVS1 H 2304 ## 40 IF H 2449 ## 41 I1 I 5696 ## 42 SI2 I 12918 ## 43 SI1 I 8736 ## 44 VS2 I 9791 ## 45 VS1 I 8383 ## 46 VVS2 I 3934 ## 47 VVS1 I 2549 ## 48 IF I 2062 ## 49 I1 J 6576 ## 50 SI2 J 11865 ## 51 SI1 J 7952 ## 52 VS2 J 8055 ## 53 VS1 J 8507 ## 54 VVS2 J 8770 ## 55 VVS1 J 8803 ## 56 IF J 5361`

However, we face one issue here: For “at a glance” investigation, the output is not easily readable because it is in “long-format” (is this the right expression?). What I’d like to have, is a kind of cross-table. The MS-Excel user might recognize this as “pivot tables”.

For this purpose we have to make us of the **xtabs**-function that creates a contingency table.

First, we create a dataframe using our aggregate function and then we pass this dataframe to xtabs, specifying that we want to see the price on the other variables:

`dmnds <- aggregate(price ~ clarity + color, diamonds, mean) xtabs(price ~ ., data = dmnds)`

`## color ## clarity D E F G H I J ## I1 3863 3488 3342 3546 4453 4302 5254 ## SI2 3931 4174 4473 5022 6100 7003 6521 ## SI1 2976 3162 3714 3775 5032 5355 5186 ## VS2 2587 2751 3757 4416 4722 5691 5311 ## VS1 3030 2856 3797 4131 3781 4633 4884 ## VVS2 3351 2500 3476 3845 2649 2968 5142 ## VVS1 2948 2220 2804 2867 1846 2035 4034 ## IF 8307 3669 2751 2558 2288 1995 3364`

That’s nice, isn’t it?

But as announced earlier, there is another way of achieving the same result: The **ddply**-function from the **plyr**-package.

`dmnds.ddply <- ddply(.data = diamonds, .(clarity, color), summarise, PRICEMEAN = mean(price)) dmnds.ddply`

`## clarity color PRICEMEAN ## 1 I1 D 3863 ## 2 I1 E 3488 ## 3 I1 F 3342 ## 4 I1 G 3546 ## 5 I1 H 4453 ## 6 I1 I 4302 ## 7 I1 J 5254 ## 8 SI2 D 3931 ## 9 SI2 E 4174 ## 10 SI2 F 4473 ## 11 SI2 G 5022 ## 12 SI2 H 6100 ## 13 SI2 I 7003 ## 14 SI2 J 6521 ## 15 SI1 D 2976 ## 16 SI1 E 3162 ## 17 SI1 F 3714 ## 18 SI1 G 3775 ## 19 SI1 H 5032 ## 20 SI1 I 5355 ## 21 SI1 J 5186 ## 22 VS2 D 2587 ## 23 VS2 E 2751 ## 24 VS2 F 3757 ## 25 VS2 G 4416 ## 26 VS2 H 4722 ## 27 VS2 I 5691 ## 28 VS2 J 5311 ## 29 VS1 D 3030 ## 30 VS1 E 2856 ## 31 VS1 F 3797 ## 32 VS1 G 4131 ## 33 VS1 H 3781 ## 34 VS1 I 4633 ## 35 VS1 J 4884 ## 36 VVS2 D 3351 ## 37 VVS2 E 2500 ## 38 VVS2 F 3476 ## 39 VVS2 G 3845 ## 40 VVS2 H 2649 ## 41 VVS2 I 2968 ## 42 VVS2 J 5142 ## 43 VVS1 D 2948 ## 44 VVS1 E 2220 ## 45 VVS1 F 2804 ## 46 VVS1 G 2867 ## 47 VVS1 H 1846 ## 48 VVS1 I 2035 ## 49 VVS1 J 4034 ## 50 IF D 8307 ## 51 IF E 3669 ## 52 IF F 2751 ## 53 IF G 2558 ## 54 IF H 2288 ## 55 IF I 1995 ## 56 IF J 3364`

Well, perhaps a bit verbose compared to the use of aggregate, but ddply allows great flexibility!

In order to bring it now into a “pivot table”, we can make use of (a) the xtabs-function again, or (b) use the cast-function in the reshape2-package (NB: Use **dcast** for dataframes):

`dcast(dmnds.ddply, clarity ~ color, value.var = "PRICEMEAN")`

`## clarity D E F G H I J ## 1 I1 3863 3488 3342 3546 4453 4302 5254 ## 2 SI2 3931 4174 4473 5022 6100 7003 6521 ## 3 SI1 2976 3162 3714 3775 5032 5355 5186 ## 4 VS2 2587 2751 3757 4416 4722 5691 5311 ## 5 VS1 3030 2856 3797 4131 3781 4633 4884 ## 6 VVS2 3351 2500 3476 3845 2649 2968 5142 ## 7 VVS1 2948 2220 2804 2867 1846 2035 4034 ## 8 IF 8307 3669 2751 2558 2288 1995 3364`

Again, we have a “formula interface” which makes the application straightforward. The parameter *“value.var”* specifies which column to take to apply the function to – very helpful in case of big datasets.

Alright, and last but not least a oneliner that achieves the same result, combining the **melt**-function and the **cast**-function (the **melt**-function is part of the reshape2 package and converts “datatables” in long format.)

`dcast(melt(diamonds, id.vars = c("cut", "color", "clarity"), measure.vars = "price"), formula = clarity ~ color, mean)`

`## clarity D E F G H I J ## 1 I1 3863 3488 3342 3546 4453 4302 5254 ## 2 SI2 3931 4174 4473 5022 6100 7003 6521 ## 3 SI1 2976 3162 3714 3775 5032 5355 5186 ## 4 VS2 2587 2751 3757 4416 4722 5691 5311 ## 5 VS1 3030 2856 3797 4131 3781 4633 4884 ## 6 VVS2 3351 2500 3476 3845 2649 2968 5142 ## 7 VVS1 2948 2220 2804 2867 1846 2035 4034 ## 8 IF 8307 3669 2751 2558 2288 1995 3364`

So first, the dataset is brought *entirely* in long format, and then **cast** is applied. But this clearly is the most time-consuming way of doing it (in terms of computation time)…

Bottom line:

My favorite function for not too big datasets is the combination of **ddply** and **cast**, sine it allows for most flexibility, I feel. That flexibility come with the price of syntax, which you’ll have to learn. **Aggregate** and **xtabs** are quickly applied for small problems (maybe also for big ones, but I do not understand how to apply for more complex tasks that ddply allows).

Additionally, I’ve heard talking about the **data.table** package, that seems to be the fastest. Go and check!

References:

]]>Basically, JGR is a Graphical User Interface for R, and Deducer is a kind of “data analysis plugin”, that also comes with a so-called “plot builder” running ggplot2.

This feature allows users to access the power of ggplot2 through a simple user interface, which makes the start with ggplot2 much easier.

The attached slide pack gives a short introduction (unfortunately only in german), but in order to see Deducer in action, I recommend to have a look at excellent tutorials like

Deducer’s plot builder – Part 1

The slides below

As well as the source code of the slides, in .Rmd format.

]]>How to deal with .Rmd-files, see here

These are my first steps to play around with the interface from *R* to *twitter*, using the **twitteR**-package.

We will load the latest 1500 (maximum the API allows) tweets from the user **@RegSprecher**, who is the spokesman of the German government and run some analysis, like:

- What devices does he use to tweet?
- What is the frequency of his tweets, and on which days was he most active?
- Is there any particular pattern about when he tweets?
- What is he tweeting about?

`## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?`

The latest 6 tweets of @RegSprecher

- Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

- The tweets are summarized by day, and the bars show the amount of tweets per day
- There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
- Normal “twee-rate” seems to be between 3 and 10 per day
- There are even some “idle” days, where he did not tweet at all – holidays?

- The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
- The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

`## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr"`

`## merkel kanzlerin ## 1.00 0.73`

- The most frequent terms are shown above ; they occur each more than 30 times.
- Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
- We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…

- The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
- Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany

——————————————————–

The Code below:

## @knitr setup library(ggplot2) library(scales) library(lubridate) library(twitteR) library(gridExtra) # set global chunk options opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE) # upload images automatically opts_knit$set(upload.fun = imgur_upload) ## @knitr load_data # load tweets and convert to dataframe regsprecher.tweets <- userTimeline("RegSprecher", n=1500) regsprecher.tweets.df <- twListToDF(regsprecher.tweets) regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004... #str(regsprecher.tweets.df) print(head(regsprecher.tweets.df[,c(1,4,10)])) ## @knitr device # Code from vignette of twitteR-package sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource()) sources <- gsub("", "", sources) sources <- strsplit(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1])) pie(table(sources)) ## @knitr freq ggplot() + geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) + scale_y_continuous(name = 'Frequency, # tweets/day') + scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b')) ## @knitr time plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day') plot2 <- ggplot() + geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) + scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20, 22,2,4,24)),limits = c(0,24)) + scale_y_continuous(name = '# tweets') grid.arrange(plot1, plot2, ncol=2) ## @knitr words # this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/ require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text)) # make each letter lowercase mydata.corpus <- tm_map(mydata.corpus, tolower) # remove punctuation mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE) # remove generic and custom stopwords my_stopwords <- c(stopwords('german')) mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) # build a term-document matrix mydata.dtm <- TermDocumentMatrix(mydata.corpus) # inspect the document-term matrix #mydata.dtm # inspect most popular words findFreqTerms(mydata.dtm, lowfreq=30) findAssocs(mydata.dtm, 'merkel', 0.20) ## @knitr words2 # remove sparse terms to simplify the cluster plot # Note: tweak the sparse parameter to determine the number of words. # About 10-30 words is good. mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97) # convert the sparse term-document matrix to a standard data frame mydata.df <- as.data.frame(inspect(mydata.dtm2)) mydata.df.scale <- scale(mydata.df) d <- dist(mydata.df.scale, method = "euclidean") # distance matrix ## @knitr words3 fit <- hclust(d, method="ward") plot(fit) # display dendogram?

R in combination with the right packages delivers a quick answer.

The result is a picture like this:

Values in the “upper right” quadrant stand for shares, that did show positive performance during the last 5 and 20 trading days. The color displays the performance over the last 250 trading days.

We can see that the last week was rather positive, all of the shares except one have positive performance. Also the last 20 days were profitable for most shares.

However, it can be seen that a lot of stocks did suffer during the last year (i.e. 250 trading days), which van bee seen from the colors – most of them are red. Most notably CBK.DE (Commerzbank) did nearly loose all of its value…

You can find the R code here:

library(quantmod) library(ggplot2) library(zoo) #get list of Symbols for DAX-values l<- c("^GDAXI", "DB1.DE", "ADS.DE", "ALV.DE", "BAS.DE","BAYN.DE","BEI.DE","BMW.DE","CBK.DE","DAI.DE","DBK.DE","DPW.DE","DTE.DE","EOAN.DE","FME.DE","FRE.DE","HEI.DE","HEN3.DE","IFX.DE","LHA.DE","LIN.DE","MAN.DE","MEO.DE","MUV2.DE","RWE.DE","SAP.DE","SDF.DE","SIE.DE","TKA.DE","VOW3.DE") getSymbols(l, from="2010-09-01") l[1] <- "GDAXI" # Function to extract "adjusted prices" and build dataframe: Thanks to Zach Mayer of moderntoolmaking.blogspot.com symbolFrame <- function(symbolList) { Data <- data.frame(NULL) for (S in symbolList) { Data <- cbind(Data,Ad(get(S))) } colnames(Data) <- symbolList return(Data) } Data <- symbolFrame(l[-1]) # build a dataframe without DAX istelf Data <- cbind(Ad(GDAXI), Data) # add DAX colnames(Data)[1] <- "DAX" tail(Data,2) #just to check - often Yahoo is not up to date and there are NAs in the last row #Data <- window(Data, start=start(Data), end=end(Data)-1) # code to delete last row... Return.calculate(Data, method="simple") -> Data.r #calculates the returns (simple) Data.r[is.na(Data.r)] <- 0 #builds frames for the respective perfromances on short, mid and long term mid.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,20)),1))) short.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,5)),1))) long.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,250)),1))) per.df <- data.frame(cbind(t(short.perf), t(mid.perf), t(long.perf))) colnames(per.df) <- c("short", "mid", "long") row.names(per.df)[1] <- "DAX" chart_title <- paste("Performance Comparison DAX values\n(latest data close of ",end(Data),")") ggplot(data=per.df, aes(short, mid, label=rownames(per.df))) + geom_point(aes(color=long), size=4) + geom_text(hjust=0, vjust=0,size=4) + geom_vline(xintercept=0) + geom_hline(yintercept=0) + scale_colour_gradient2(low="red", high="green", "250days\nPerformance") + scale_y_continuous("Mid Performance: 20days", formatter="percent") + scale_x_continuous("Short Performance: 5days", formatter="percent") + opts(title = chart_title)

And there is another blog…

I’ll use this in order to jot down my thoughts on Finance, R and all other stuff, that’s on my mind. But I Think most of it will be around how to use R to analyze data, mainly from Finance. I grateful to all these people, who openly share their thoughts and codes on the web.

So let’s see where we get to…

]]>