R talks to Weka about Data Mining

R provides us with excellent resources to mine data, and there are some good overviews out there:

And there are other tools out there for data mining, like Weka.

Weka has a GUI and can be directed via the command line with Java as well, and Weka has a large variety of algorithms included. If, for whatever reason, you do not find the algorithm you need being implemented in R, Weka might be the place to go. And the RWeka-package marries R and Weka.

I am not an expert neither in R, nor in Weka, nor in data mining. But I happen to play around with them, and I’d like to share a starter on how to work with them. There is good documentation out there (e.g. Open-Source Machine Learning: R Meets Weka or RWeka Odds and Ends), but sometimes you want to document your own steps and ways of working, and this is what I do.

So, I want to build a classification model for the iris-dataset, based on a tree classifier. Joice is the C4.5 algorithm that I did not find implemented in any standard R package.

UPDATE: “Z” made the comment below, that the C5.0-algorithm is implemented in the C50-package. And indeed, Max Kuhn has given a nice presentation about it at UseR! 2013, that you can find here.

We want to predict the class of a flower based on their attributes, namely sepal and petal width and length. The three species we have are “setosa”, “versicolor” and “virginica”. A short summary is given above.

Prediction with J48 (aka C4.5)

We next load the RWeka package.

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

library(RWeka)

We now build the classifier, and this works with the J48(.)-function:

iris_j48 <- J48(Species ~ ., data = iris)
iris_j48

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  5
## 
## Size of the tree :   9

summary(iris_j48)

## 
## === Summary ===
## 
## Correctly Classified Instances         147               98      %
## Incorrectly Classified Instances         3                2      %
## Kappa statistic                          0.97  
## Mean absolute error                      0.0233
## Root mean squared error                  0.108 
## Relative absolute error                  5.2482 %
## Root relative squared error             22.9089 %
## Coverage of cases (0.95 level)          98.6667 %
## Mean rel. region size (0.95 level)      34      %
## Total Number of Instances              150     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  50  0  0 |  a = setosa
##   0 49  1 |  b = versicolor
##   0  2 48 |  c = virginica

# plot(iris_j48) 
library(partykit)
plot(as.party(iris_j48)) # we use the partykit-package for nice plotting.

plot of chunk build_J48

We can assign the model to an object, and printing the object gives us the tree in “Weka-Output”, summary(.) gives us the Summary of the classification on the training set (again, in Weka-style), and plot(.) allows us to nicely plot it, while plot.party(.) gives us an even nicer plot (thanks to Z for his comment below).

Evaluation in Weka

Well, we used the whole dataset now for training, but we actually might want to perform cross-validation. This can be done like this:

eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE, 
    seed = 1, class = TRUE)
eval_j48

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances         144               96      %
## Incorrectly Classified Instances         6                4      %
## Kappa statistic                          0.94  
## Mean absolute error                      0.035 
## Root mean squared error                  0.1586
## Relative absolute error                  7.8705 %
## Root relative squared error             33.6353 %
## Coverage of cases (0.95 level)          96.6667 %
## Mean rel. region size (0.95 level)      33.7778 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     setosa
##                  0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     versicolor
##                  0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     virginica
## Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   0 47  3 |  b = versicolor
##   0  2 48 |  c = virginica

We see slightly worse results now, as you would suspect.

Using Weka-controls

We used the standard options for th J48 classifier, but Weka allows more. You can acces these with the WOW-function:

WOW("J48")

## -U      Use unpruned tree.
## -O      Do not collapse tree.
## -C <pruning confidence>
##         Set confidence threshold for pruning.  (default 0.25)
##  Number of arguments: 1.
## -M <minimum number of instances>
##         Set minimum number of instances per leaf.  (default 2)
##  Number of arguments: 1.
## -R      Use reduced error pruning.
## -N <number of folds>
##         Set number of folds for reduced error pruning. One fold is
##         used as pruning set.  (default 3)
##  Number of arguments: 1.
## -B      Use binary splits only.
## -S      Don't perform subtree raising.
## -L      Do not clean up after the tree has been built.
## -A      Laplace smoothing for predicted probabilities.
## -J      Do not use MDL correction for info gain on numeric
##         attributes.
## -Q <seed>
##         Seed for random data shuffling (default 1).
##  Number of arguments: 1.

If, for example, we want to use a tree with minimum 10 instances in each leaf, we change the command as follows:

j48_control <- J48(Species ~ ., data = iris, control = Weka_control(M = 10))
j48_control

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7: versicolor (54.0/5.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  3
## 
## Size of the tree :   5

And you see the tree is different (well, it just does not go as deep as the other one..).

Building cost-sensitive classifiers

You might want to include a cost matrix, i.e you want to penalize some wrong classifications, see here. If you think classifying for example a versicolor wrongly is very harmful, you want to penalize such a classification in our example, you can do that easily – you just have to choose a different classifier, namely the “Cost-sensitive classifier” in Weka:

csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(`cost-matrix` = matrix(c(0, 
    10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48", 
    M = TRUE))

But you have to tell the “cost-sensitive-classifier” that you want to use J48 as algorithm, and you have to tell him the cost matrix you want to apply, name ly the matrix of the form

matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    1    0    1
## [3,]    0    0    0

where you penalize “versicolor” being falsly classified as one of the others by factor 10.

And again we evaluate on 10-fold CV:

eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE, 
    seed = 1, class = TRUE)
eval_csc

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances          98               65.3333 %
## Incorrectly Classified Instances        52               34.6667 %
## Kappa statistic                          0.48  
## Mean absolute error                      0.2311
## Root mean squared error                  0.4807
## Relative absolute error                 52      %
## Root relative squared error            101.9804 %
## Coverage of cases (0.95 level)          65.3333 %
## Mean rel. region size (0.95 level)      33.3333 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.070    0.875      0.980    0.925      0.887    0.955     0.864     setosa
##                  0.980    0.450    0.521      0.980    0.681      0.517    0.765     0.518     versicolor
##                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.333     virginica
## Weighted Avg.    0.653    0.173    0.465      0.653    0.535      0.468    0.740     0.572     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   1 49  0 |  b = versicolor
##   6 44  0 |  c = virginica

and we see that the “versicolors” are now better predicted (only one wrong, compared to 3 in the normal J48 earlier). But this happened at the expense of more fals classification on “virginica”, where we have now 6 wrongly classified instead of 2.

Alright, this is just a short starter. I suggest you check out the very good introductions I referred to earlier to explore the full wealth of RWeka… Have fun!

About these ads

7 Responses to R talks to Weka about Data Mining

  1. Nice article. I believe it is very helpful for people entering R from WEKA and the opposite.

    May I suggest my webpage on Text Mining with WEKA as a reference? It is: http://www.esp.uem.es/jmgomez/tmweka/

    Some time ago I explained how to use cost-sensitive classification in WEKA to simmulate over-sampling and down-sampling for very imbalanced datasets: http://jmgomezhidalgo.blogspot.com.es/2008/03/class-imbalanced-distribution-and-weka.html

    Regards

  2. Lou says:

    Have you used Rapidminer in your work?

  3. Z says:

    Hi, you can get an even nicer plot if you use the as.party function from the partykit package: plot(as.party(.)). Also you might be interested in the C5.0 algorithm, Quinlan’s follow-up to C4.5, that is available in the R package C50.

  4. Freddy says:

    It would be helpful to state how one can use the options of J48, while using cost-sensitive classifiers. I believe it’s done using W = “weka.classifiers.trees.J48 — -M 10″, but not 100% sure. I was also wondering what complexity does when evaluating.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: