LOESS regression with R

The other day, I came a small problem: I was investigating a dataset, and the different variables clearly showed a non-linear behaviour.
Consequently, I could not apply the classical linear regression.
An approach to solve this kind of problem is LOESS regression, which stands for locally weighted scatterplot smoothing. More information can be found here.

In the following, we will work through an artificial example to make the concept clear and show how to apply in R.

We first create a dataframe with two variables, a and y, which are non-linearily related. The plot shows this relationship.

We perform two regressions, one linear and one loess.

## 
## Call:
## lm(formula = y ~ a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.303 -0.635  0.294  0.792  1.635 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.100335   0.136374    22.7   <2e-16 ***
## a           0.007645   0.000305    25.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.973 on 98 degrees of freedom
## Multiple R-squared: 0.865,   Adjusted R-squared: 0.864 
## F-statistic:  630 on 1 and 98 DF,  p-value: <2e-16

## Call:
## loess(formula = y ~ a)
## 
## Number of Observations: 100 
## Equivalent Number of Parameters: 4.56 
## Residual Standard Error: 0.587 
## Trace of smoother matrix: 4.98 
## 
## Control settings:
##   normalize:  TRUE 
##   span       :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2

The scatterplot clearly shows the better fit from the loess regression.
The summary of the regression shows one difference between linear and loess: The loess regression does not give the parameters, since it is a local regression. The classical measure of goodness of fit is r2.
The linear model has a r2 of 0.865. For the LOESS we have to calculate the r2 and follow this proposal.

## [1] 0.9528

The r2 from the loess is 0.953 and thus very good and better than the r2 from the linear regression.
Also the investigation of the plot of residuals vs fitted/predicted values indicates a much better fit of the LOSS regression compared to the linear regression (the residuals plot of the linear regression shows the structure – which we clearly do not want); the QQ-plot shows some issues on the tails.

But what to do now with it? Finally, we probably want to predict some values out of this exercise.
The predict-function helps us with this exercise.

##     a linreg loess
## 1  10  3.177 2.265
## 2 400  6.158 7.133
## 3 900  9.981 9.365

Alright, this was a quick introduction on how to apply the LOESS regression.

And here is the full code:

## @knitr create_data
y <- seq(from=1, to=10, length.out=100)
a <- y^3 +y^2  + rnorm(100,mean=0, sd=30)
data <- data.frame(a=a, y=y)
plot(y=y, x=a)

## @knitr linreg
linreg <- lm(y~a)
summary(linreg)

loess <- loess(y~a)
summary(loess)

scatter.smooth(data)
abline(linreg, col="blue")

## @knitr loess_fit
hat <- predict(loess)
plot(y~a)
lines(a[order(a)], hat[order(hat)], col="red")
(r_sq_loess <- cor(y, hat)^2)

## @knitr fit_loess
par(mfrow=c(2,2))
plot(linreg$fitted, linreg$residuals, main="classical linear regression")
plot(loess$fitted, loess$residuals, main="LOESS")
# normal probablility plot
qqnorm(linreg$residuals, ylim=c(-2,2)) 
qqline(linreg$residuals)
qqnorm(loess$residuals, ylim=c(-2,2)) 
qqline(loess$residuals)

## @knitr predict
predict <- data.frame(a=c(10,400,900))
scatter.smooth(data)
abline(linreg, col="blue")

predict$linreg <- predict(linreg, predict)
predict$loess <- predict(loess, predict)

predict

points(x=predict$a, y=predict(linreg, predict), col="blue", pch=18, cex=2)
points(x=predict$a, y=predict(loess, predict), col="red", pch=18, cex=2)

Created by Pretty R at inside-R.org

Basic ideas on aggregate, plyr and crosstables!

A common task using R is the investigation of one particular dataset. Usually we have a mixture of numerical and categorial data and are interested in some statistics (e.g. means and so on).
And there are a lot of threads, blogs etc around that. Sorry for adding another one, but so I remember myself.

Let’s take as example the dataset diamonds that is included in the ggplot2-package.

library(ggplot2)
data(diamonds)
str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We can see that the datsets consist of a mixture of variales and types.
Let’s say we are interest in the average price of diamonds with regards to the category of clarity.

Do it with agggregate
Aggregate comes with the stats package. The relevant call to answer our questions could look like

aggregate(price ~ clarity, diamonds, mean)

##   clarity price
## 1      I1  3924
## 2     SI2  5063
## 3     SI1  3996
## 4     VS2  3925
## 5     VS1  3839
## 6    VVS2  3284
## 7    VVS1  2523
## 8      IF  2865

Et voilà. Other way of typing would be the usage of the “list” argument as below.
We can also enlarge the list argument by other categorical variables, as shown as well, using clarity and color.

aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity), 
    mean)

##   CLARITY    x
## 1      I1 3924
## 2     SI2 5063
## 3     SI1 3996
## 4     VS2 3925
## 5     VS1 3839
## 6    VVS2 3284
## 7    VVS1 2523
## 8      IF 2865

aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity, 
    COLOR = diamonds$color), mean)

##    CLARITY COLOR    x
## 1       I1     D 3863
## 2      SI2     D 3931
## 3      SI1     D 2976
## 4      VS2     D 2587
## 5      VS1     D 3030
## 6     VVS2     D 3351
## 7     VVS1     D 2948
## 8       IF     D 8307
## 9       I1     E 3488
## 10     SI2     E 4174
## 11     SI1     E 3162
## 12     VS2     E 2751
## 13     VS1     E 2856
## 14    VVS2     E 2500
## 15    VVS1     E 2220
## 16      IF     E 3669
## 17      I1     F 3342
## 18     SI2     F 4473
## 19     SI1     F 3714
## 20     VS2     F 3757
## 21     VS1     F 3797
## 22    VVS2     F 3476
## 23    VVS1     F 2804
## 24      IF     F 2751
## 25      I1     G 3546
## 26     SI2     G 5022
## 27     SI1     G 3775
## 28     VS2     G 4416
## 29     VS1     G 4131
## 30    VVS2     G 3845
## 31    VVS1     G 2867
## 32      IF     G 2558
## 33      I1     H 4453
## 34     SI2     H 6100
## 35     SI1     H 5032
## 36     VS2     H 4722
## 37     VS1     H 3781
## 38    VVS2     H 2649
## 39    VVS1     H 1846
## 40      IF     H 2288
## 41      I1     I 4302
## 42     SI2     I 7003
## 43     SI1     I 5355
## 44     VS2     I 5691
## 45     VS1     I 4633
## 46    VVS2     I 2968
## 47    VVS1     I 2035
## 48      IF     I 1995
## 49      I1     J 5254
## 50     SI2     J 6521
## 51     SI1     J 5186
## 52     VS2     J 5311
## 53     VS1     J 4884
## 54    VVS2     J 5142
## 55    VVS1     J 4034
## 56      IF     J 3364

We clearly achieve the same result using the above “formula notation”:

aggregate(price ~ clarity + color, diamonds, mean)

##    clarity color price
## 1       I1     D  3863
## 2      SI2     D  3931
## 3      SI1     D  2976
## 4      VS2     D  2587
## 5      VS1     D  3030
## 6     VVS2     D  3351
## 7     VVS1     D  2948
## 8       IF     D  8307
## 9       I1     E  3488
## 10     SI2     E  4174
## 11     SI1     E  3162
## 12     VS2     E  2751
## 13     VS1     E  2856
## 14    VVS2     E  2500
## 15    VVS1     E  2220
## 16      IF     E  3669
## 17      I1     F  3342
## 18     SI2     F  4473
## 19     SI1     F  3714
## 20     VS2     F  3757
## 21     VS1     F  3797
## 22    VVS2     F  3476
## 23    VVS1     F  2804
## 24      IF     F  2751
## 25      I1     G  3546
## 26     SI2     G  5022
## 27     SI1     G  3775
## 28     VS2     G  4416
## 29     VS1     G  4131
## 30    VVS2     G  3845
## 31    VVS1     G  2867
## 32      IF     G  2558
## 33      I1     H  4453
## 34     SI2     H  6100
## 35     SI1     H  5032
## 36     VS2     H  4722
## 37     VS1     H  3781
## 38    VVS2     H  2649
## 39    VVS1     H  1846
## 40      IF     H  2288
## 41      I1     I  4302
## 42     SI2     I  7003
## 43     SI1     I  5355
## 44     VS2     I  5691
## 45     VS1     I  4633
## 46    VVS2     I  2968
## 47    VVS1     I  2035
## 48      IF     I  1995
## 49      I1     J  5254
## 50     SI2     J  6521
## 51     SI1     J  5186
## 52     VS2     J  5311
## 53     VS1     J  4884
## 54    VVS2     J  5142
## 55    VVS1     J  4034
## 56      IF     J  3364

Since I find the formula quotation much more intuitive I’ll continue using that one.

It allows also to pass further arguments, like the probabilities for the quantile-function:

aggregate(price ~ clarity + color, diamonds, FUN = quantile, probs = 0.8)

##    clarity color price
## 1       I1     D  5139
## 2      SI2     D  5257
## 3      SI1     D  5096
## 4      VS2     D  3305
## 5      VS1     D  4057
## 6     VVS2     D  4477
## 7     VVS1     D  3781
## 8       IF     D 16015
## 9       I1     E  4191
## 10     SI2     E  5150
## 11     SI1     E  5068
## 12     VS2     E  3585
## 13     VS1     E  3466
## 14    VVS2     E  2857
## 15    VVS1     E  2514
## 16      IF     E  6002
## 17      I1     F  4570
## 18     SI2     F  5163
## 19     SI1     F  5292
## 20     VS2     F  6598
## 21     VS1     F  7265
## 22    VVS2     F  8218
## 23    VVS1     F  3501
## 24      IF     F  2656
## 25      I1     G  6209
## 26     SI2     G  6785
## 27     SI1     G  5342
## 28     VS2     G  7031
## 29     VS1     G  7508
## 30    VVS2     G  7952
## 31    VVS1     G  3327
## 32      IF     G  2575
## 33      I1     H  6530
## 34     SI2     H  9257
## 35     SI1     H  7173
## 36     VS2     H  7510
## 37     VS1     H  6796
## 38    VVS2     H  4269
## 39    VVS1     H  2304
## 40      IF     H  2449
## 41      I1     I  5696
## 42     SI2     I 12918
## 43     SI1     I  8736
## 44     VS2     I  9791
## 45     VS1     I  8383
## 46    VVS2     I  3934
## 47    VVS1     I  2549
## 48      IF     I  2062
## 49      I1     J  6576
## 50     SI2     J 11865
## 51     SI1     J  7952
## 52     VS2     J  8055
## 53     VS1     J  8507
## 54    VVS2     J  8770
## 55    VVS1     J  8803
## 56      IF     J  5361

However, we face one issue here: For “at a glance” investigation, the output is not easily readable because it is in “long-format” (is this the right expression?). What I’d like to have, is a kind of cross-table. The MS-Excel user might recognize this as “pivot tables”.
For this purpose we have to make us of the xtabs-function that creates a contingency table.
First, we create a dataframe using our aggregate function and then we pass this dataframe to xtabs, specifying that we want to see the price on the other variables:

dmnds <- aggregate(price ~ clarity + color, diamonds, mean)
xtabs(price ~ ., data = dmnds)

##        color
## clarity    D    E    F    G    H    I    J
##    I1   3863 3488 3342 3546 4453 4302 5254
##    SI2  3931 4174 4473 5022 6100 7003 6521
##    SI1  2976 3162 3714 3775 5032 5355 5186
##    VS2  2587 2751 3757 4416 4722 5691 5311
##    VS1  3030 2856 3797 4131 3781 4633 4884
##    VVS2 3351 2500 3476 3845 2649 2968 5142
##    VVS1 2948 2220 2804 2867 1846 2035 4034
##    IF   8307 3669 2751 2558 2288 1995 3364

That’s nice, isn’t it?

But as announced earlier, there is another way of achieving the same result: The ddply-function from the plyr-package.

dmnds.ddply <- ddply(.data = diamonds, .(clarity, color), summarise, 
    PRICEMEAN = mean(price))
dmnds.ddply

##    clarity color PRICEMEAN
## 1       I1     D      3863
## 2       I1     E      3488
## 3       I1     F      3342
## 4       I1     G      3546
## 5       I1     H      4453
## 6       I1     I      4302
## 7       I1     J      5254
## 8      SI2     D      3931
## 9      SI2     E      4174
## 10     SI2     F      4473
## 11     SI2     G      5022
## 12     SI2     H      6100
## 13     SI2     I      7003
## 14     SI2     J      6521
## 15     SI1     D      2976
## 16     SI1     E      3162
## 17     SI1     F      3714
## 18     SI1     G      3775
## 19     SI1     H      5032
## 20     SI1     I      5355
## 21     SI1     J      5186
## 22     VS2     D      2587
## 23     VS2     E      2751
## 24     VS2     F      3757
## 25     VS2     G      4416
## 26     VS2     H      4722
## 27     VS2     I      5691
## 28     VS2     J      5311
## 29     VS1     D      3030
## 30     VS1     E      2856
## 31     VS1     F      3797
## 32     VS1     G      4131
## 33     VS1     H      3781
## 34     VS1     I      4633
## 35     VS1     J      4884
## 36    VVS2     D      3351
## 37    VVS2     E      2500
## 38    VVS2     F      3476
## 39    VVS2     G      3845
## 40    VVS2     H      2649
## 41    VVS2     I      2968
## 42    VVS2     J      5142
## 43    VVS1     D      2948
## 44    VVS1     E      2220
## 45    VVS1     F      2804
## 46    VVS1     G      2867
## 47    VVS1     H      1846
## 48    VVS1     I      2035
## 49    VVS1     J      4034
## 50      IF     D      8307
## 51      IF     E      3669
## 52      IF     F      2751
## 53      IF     G      2558
## 54      IF     H      2288
## 55      IF     I      1995
## 56      IF     J      3364

Well, perhaps a bit verbose compared to the use of aggregate, but ddply allows great flexibility!
In order to bring it now into a “pivot table”, we can make use of (a) the xtabs-function again, or (b) use the cast-function in the reshape2-package (NB: Use dcast for dataframes):

dcast(dmnds.ddply, clarity ~ color, value.var = "PRICEMEAN")

##   clarity    D    E    F    G    H    I    J
## 1      I1 3863 3488 3342 3546 4453 4302 5254
## 2     SI2 3931 4174 4473 5022 6100 7003 6521
## 3     SI1 2976 3162 3714 3775 5032 5355 5186
## 4     VS2 2587 2751 3757 4416 4722 5691 5311
## 5     VS1 3030 2856 3797 4131 3781 4633 4884
## 6    VVS2 3351 2500 3476 3845 2649 2968 5142
## 7    VVS1 2948 2220 2804 2867 1846 2035 4034
## 8      IF 8307 3669 2751 2558 2288 1995 3364

Again, we have a “formula interface” which makes the application straightforward. The parameter “value.var” specifies which column to take to apply the function to – very helpful in case of big datasets.

Alright, and last but not least a oneliner that achieves the same result, combining the melt-function and the cast-function (the melt-function is part of the reshape2 package and converts “datatables” in long format.)

dcast(melt(diamonds, id.vars = c("cut", "color", "clarity"), measure.vars = "price"), 
    formula = clarity ~ color, mean)

##   clarity    D    E    F    G    H    I    J
## 1      I1 3863 3488 3342 3546 4453 4302 5254
## 2     SI2 3931 4174 4473 5022 6100 7003 6521
## 3     SI1 2976 3162 3714 3775 5032 5355 5186
## 4     VS2 2587 2751 3757 4416 4722 5691 5311
## 5     VS1 3030 2856 3797 4131 3781 4633 4884
## 6    VVS2 3351 2500 3476 3845 2649 2968 5142
## 7    VVS1 2948 2220 2804 2867 1846 2035 4034
## 8      IF 8307 3669 2751 2558 2288 1995 3364

So first, the dataset is brought entirely in long format, and then cast is applied. But this clearly is the most time-consuming way of doing it (in terms of computation time)…

Bottom line:

My favorite function for not too big datasets is the combination of ddply and cast, sine it allows for most flexibility, I feel. That flexibility come with the price of syntax, which you’ll have to learn. Aggregate and xtabs are quickly applied for small problems (maybe also for big ones, but I do not understand how to apply for more complex tasks that ddply allows).
Additionally, I’ve heard talking about the data.table package, that seems to be the fastest. Go and check!

References:

mages’ blog on by, apply and friends

Using twitteR to see, what german press secretary tweets about

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

What this is about

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

  • What devices does he use to tweet?
  • What is the frequency of his tweets, and on which days was he most active?
  • Is there any particular pattern about when he tweets?
  • What is he tweeting about?

Load data

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

Which device does he use?

  • Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

Analysis of frequency

  • The tweets are summarized by day, and the bars show the amount of tweets per day
  • There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
  • Normal “twee-rate” seems to be between 3 and 10 per day
  • There are even some “idle” days, where he did not tweet at all – holidays?

At what time of day is he tweeting?

  • The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
  • The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 
  • The most frequent terms are shown above ; they occur each more than 30 times.
  • Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
  • We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
  • The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
  • Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany 😉

——————————————————–
The Code below:

## @knitr setup
library(ggplot2)
library(scales)
library(lubridate)
library(twitteR)
library(gridExtra)
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE)
# upload images automatically
opts_knit$set(upload.fun = imgur_upload)

## @knitr load_data
# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...
#str(regsprecher.tweets.df)
print(head(regsprecher.tweets.df[,c(1,4,10)]))

## @knitr device
# Code from vignette of twitteR-package
  sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource())
  sources <- gsub("", "", sources)
  sources <- strsplit(sources, ">")
  sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
  pie(table(sources))

## @knitr freq
  ggplot() +
 geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) +
 scale_y_continuous(name = 'Frequency, # tweets/day') +
 scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b'))

## @knitr time
plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day')
plot2 <- ggplot() +
 geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) +
 scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20,
 22,2,4,24)),limits = c(0,24)) +
 scale_y_continuous(name = '# tweets')
grid.arrange(plot1, plot2, ncol=2)

## @knitr words
# this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
require(tm)
# build a corpus
mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 
# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
#mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

Created by Pretty R at inside-R.org

Displaying german stock performance with R using ggplot2

I cannot follow stock market developments daily, so I was looking for a quick overview of what had happened in the last week. What would be of interest for me is  “How did German stocks perform over the last 5 days, compared to the last 20 trading days and the last 250 trading days”.

R in combination with the right packages delivers a quick answer.

The result is a picture like this:

Performance Plot

Plot of German stock performance

Values in the “upper right” quadrant stand for shares, that did show positive performance during the last 5 and 20 trading days. The color displays the performance over the last 250 trading days.

We can see that the last week was rather positive, all of the shares except one have positive performance.  Also the last 20 days were profitable for most shares.

However, it can be seen that a lot of stocks did suffer during the last year (i.e. 250 trading days), which van bee seen from the colors – most of them are red. Most notably CBK.DE (Commerzbank) did nearly loose all of its value…

You can find the R code here:

library(quantmod)
library(ggplot2)
library(zoo)

#get list of Symbols for DAX-values
l<- c("^GDAXI", "DB1.DE", "ADS.DE", "ALV.DE", "BAS.DE","BAYN.DE","BEI.DE","BMW.DE","CBK.DE","DAI.DE","DBK.DE","DPW.DE","DTE.DE","EOAN.DE","FME.DE","FRE.DE","HEI.DE","HEN3.DE","IFX.DE","LHA.DE","LIN.DE","MAN.DE","MEO.DE","MUV2.DE","RWE.DE","SAP.DE","SDF.DE","SIE.DE","TKA.DE","VOW3.DE")
getSymbols(l, from="2010-09-01")
l[1] <- "GDAXI"

# Function to extract "adjusted prices" and build dataframe: Thanks to Zach Mayer of moderntoolmaking.blogspot.com
symbolFrame <- function(symbolList) {
Data <- data.frame(NULL)
for (S in symbolList) {
Data <- cbind(Data,Ad(get(S)))
}
colnames(Data) <- symbolList
return(Data)

}

Data <- symbolFrame(l[-1]) # build a dataframe without DAX istelf
Data <- cbind(Ad(GDAXI), Data) # add DAX
colnames(Data)[1] <- "DAX"
tail(Data,2) #just to check - often Yahoo is not up to date and there are NAs in the last row
#Data <- window(Data, start=start(Data), end=end(Data)-1) # code to delete last row...

Return.calculate(Data, method="simple") -> Data.r #calculates the returns (simple)
Data.r[is.na(Data.r)] <- 0 

#builds frames for the respective perfromances on short, mid and long term
mid.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,20)),1)))
short.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,5)),1)))
long.perf <- as.data.frame(coredata(tail(cumsum(tail(Data.r,250)),1)))

per.df <- data.frame(cbind(t(short.perf), t(mid.perf), t(long.perf)))

colnames(per.df) <- c("short", "mid", "long")
row.names(per.df)[1] <- "DAX"
chart_title <- paste("Performance Comparison DAX values\n(latest data close of ",end(Data),")")
ggplot(data=per.df, aes(short, mid, label=rownames(per.df))) + geom_point(aes(color=long), size=4) +
  geom_text(hjust=0, vjust=0,size=4) + geom_vline(xintercept=0) + geom_hline(yintercept=0) +
  scale_colour_gradient2(low="red", high="green", "250days\nPerformance") +
  scale_y_continuous("Mid Performance: 20days", formatter="percent") +
  scale_x_continuous("Short Performance: 5days", formatter="percent") +
  opts(title = chart_title)

Created by Pretty R at inside-R.org