Decision Tree vs SVM

Introduction

This paper is about the evaluation and comparison of two classification methods as implemented by our friends, The Decision Tree and The Support Vector Machine. We will be seeking a binary classification outcome to predict whether the stay of a patient is three days or less, or more than three days at a Melbourne hospital.

The Data

As they say, it’s all about the data. In this study we are looking at a data set which contains observations of all Emergency Surgery admissions between February 2011 and December 2017 at a Melbourne super-specialty hospital. The data set, same as some of the patients in it, has several problems and issues.

After loading the data set into the usual R-flavored data frame we source the dimensions of it.

dim(ER_df)

## [1] 15297    24

The data set contains 15297 observations and 24 features. A quick peak under the hood of ER_df reveals the following horrors.

library(tidyverse)
glimpse(ER_df)

## Observations: 15,297
## Variables: 24
## $ MatchIPM                 <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1...
## $ Patient_Num              <int> 2514, 3507, 1867, 1019, 1276, 3603, 5...
## $ MHAdmissionDate          <fct> 19-Mar-10, 8-Feb-11, 19-Feb-11, 20-Fe...
## $ DischargeDate            <fct> 18-Jun-11, 8-Mar-11, 25-Feb-11, 21-Fe...
## $ LOS                      <int> 456, 28, 6, 1, 1, 2, 4, 1, 0, 2, 3, 2...
## $ LOS_cat                  <int> 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ Gender                   <fct> M, M, M, M, M, F, F, F, F, F, F, M, M...
## $ AdmissionAge             <int> 55, 85, 53, 47, 83, 46, 60, 71, 53, 8...
## $ Age60Plus                <int> 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0...
## $ Death                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AdmissionType            <fct> Emergency, Emergency, Emergency, Emer...
## $ Outcome                  <fct> NULL, Rehabilitation, Inpatient, Home...
## $ GenSurg_LOS              <int> 0, 28, 6, 1, 1, 2, 4, 1, 0, 2, 3, 2, ...
## $ MH_LOS                   <int> 456, 28, 6, 1, 1, 2, 4, 1, 0, 2, 3, 2...
## $ DiagnosisText            <fct> Constipation, Constipation, Rectal Fo...
## $ DiseaseProcess           <fct> NULL, NULL, Trauma, Inflammatory, Inf...
## $ BodySystem               <fct> NULL, NULL, Lower GI, Hepatobiliary, ...
## $ Operative                <int> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1...
## $ Complications            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Endoscopic               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ OperationId              <fct> NULL, NULL, NULL, NULL, NULL, 43, 59,...
## $ OperationCategory        <fct> NULL, NULL, NULL, NULL, NULL, ABDOMIN...
## $ OperationProcedure       <fct> NULL, NULL, NULL, NULL, NULL, Major v...
## $ OperationDescriptionText <fct> NULL, NULL, NULL, NULL, NULL, Para-um...

A closer inspection into those values labelled NULL, reveals that they are not true-blue NAs. No, they are strings. And, for that reason our trusty old method of detecting and counting NAs fails miserably as illustrated by the following step:

sapply(ER_df, function(x) sum(is.na(x)))

##                 MatchIPM              Patient_Num          MHAdmissionDate 
##                        0                        0                        0 
##            DischargeDate                      LOS                  LOS_cat 
##                        0                        0                        0 
##                   Gender             AdmissionAge                Age60Plus 
##                        0                        0                        0 
##                    Death            AdmissionType                  Outcome 
##                        0                        0                        0 
##              GenSurg_LOS                   MH_LOS            DiagnosisText 
##                        0                        0                        0 
##           DiseaseProcess               BodySystem                Operative 
##                        0                        0                        0 
##            Complications               Endoscopic              OperationId 
##                        0                        0                        0 
##        OperationCategory       OperationProcedure OperationDescriptionText 
##                        0                        0                        0

Armed with that bit of insight we can now deliberately search for the words “NULL”, “Null” and "". From there, we can replace them with proper NAs.

ER_df[ER_df=="NULL"] <- NA
ER_df[ER_df=="Null"] <- NA
ER_df[ER_df==""]     <- NA

Running our NA detection code again identifies the true situation regarding NAs in this data set.

sapply(ER_df, function(x) sum(is.na(x)))

##                 MatchIPM              Patient_Num          MHAdmissionDate 
##                        0                        0                        0 
##            DischargeDate                      LOS                  LOS_cat 
##                        0                        0                        0 
##                   Gender             AdmissionAge                Age60Plus 
##                        0                        0                        0 
##                    Death            AdmissionType                  Outcome 
##                        0                        0                    10864 
##              GenSurg_LOS                   MH_LOS            DiagnosisText 
##                        0                        0                        0 
##           DiseaseProcess               BodySystem                Operative 
##                    12538                    13377                        0 
##            Complications               Endoscopic              OperationId 
##                        0                        0                    10175 
##        OperationCategory       OperationProcedure OperationDescriptionText 
##                    10175                    10175                    10175

From the above output we observe that Outcome, OperationId, OperationDescriptionText, OperationProcedure, BodySystem, DiseaseProcess and OperationCategory all contains more than 10000 NAs. Imputing these dimensions would be a waste of time and can therefore be omitted from the data set as they do not contribute any substantial information to the variance of the data.
Without any further ado, let’s get rid of them.

ER_df <- ER_df[,-which(names(ER_df) %in% c("Outcome",                 
                                           "OperationId",             
                                           "OperationDescriptionText",
                                           "OperationProcedure",
                                           "BodySystem",        
                                           "DiseaseProcess",    
                                           "OperationCategory"))]

Let’s have a look at the number of levels for each of the categorical variables in the remaining data set.

str(ER_df)

## 'data.frame':    15297 obs. of  17 variables:
##  $ MatchIPM       : int  0 1 1 1 0 1 0 1 1 1 ...
##  $ Patient_Num    : int  2514 3507 1867 1019 1276 3603 5866 2289 3220 4067 ...
##  $ MHAdmissionDate: Factor w/ 2491 levels "1-Apr-11","1-Apr-12",..: 862 2348 835 1001 1001 1001 1001 1082 1082 1082 ...
##  $ DischargeDate  : Factor w/ 2492 levels "1-Apr-11","1-Apr-12",..: 775 2376 1410 1082 1082 1164 1328 1164 1082 1245 ...
##  $ LOS            : int  456 28 6 1 1 2 4 1 0 2 ...
##  $ LOS_cat        : int  1 1 1 0 0 0 1 0 0 0 ...
##  $ Gender         : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 1 1 1 ...
##  $ AdmissionAge   : int  55 85 53 47 83 46 60 71 53 81 ...
##  $ Age60Plus      : int  0 1 0 0 1 0 1 1 0 1 ...
##  $ Death          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AdmissionType  : Factor w/ 4 levels "Day Procedure",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ GenSurg_LOS    : int  0 28 6 1 1 2 4 1 0 2 ...
##  $ MH_LOS         : int  456 28 6 1 1 2 4 1 0 2 ...
##  $ DiagnosisText  : Factor w/ 3705 levels " Acute appendicitis",..: 938 938 2832 511 690 1476 461 225 259 225 ...
##  $ Operative      : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ Complications  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Endoscopic     : int  0 0 0 0 0 0 0 0 0 0 ...

From the 17 features left, after the above culling, we observe that MatchIPM and Patient_Num are both two indexes, which does not say anything of value re. the length of stay of the patient or about the physical state of the patient. The factors MHAdmissionDate, DischargeDate and DiagnosisText all have extra-ordinary high numbers of unique values. The number of permutations needed to fit a categorical feature is equal to 2^N, where N is the number of unique levels. Therefore, any categorical feature with more than 52 levels are removed from the data set. I feel obliged to leave in the DischargeDate feature for now as it might tell us something about the time frame in which the patients were submitted. Therefore, we only get rid of the other two along with MatchIPM and Patient_Num.

ER_df <- ER_df[,-which(names(ER_df) %in% c("MatchIPM",
                                           "Patient_Num",
                                           "MHAdmissionDate",                 
                                           "DiagnosisText"))]

A quick inspection of GenSurg_LOS (Length of stay directly associated with surgery) reveals the following fly in the ointment.

unique(ER_df$GenSurg_LOS)

##  [1]    0   28    6    1    2    4    3   11    7    8   10    5   13   -9
## [15]  -11   15   12    9   16  -16   17   30   92   21   22   27   14   94
## [29]   19   18   20   43   73   25   36   23   -2   26   24   32   41   31
## [43]   48   -1   70   44   35 -323   96 -298  -10   29   -3   -5 -101  -14
## [57] -319 -280  -73  -15 -119 -325   37   57 -146 -200   40   53  -23   71
## [71] -336   39  -13   78   33   42   47   50  -30  -81  -40  -25   72   97
## [85]  -54 -293  -64  -44  -87   60  -47

Those negative numbers for GenSurg_LOS are disturbing. Obviously something has gone wrong in the data and therefore GenSurg_LOS is unusable (as it is impossible to stay -323 days at the hospital. Many have tried but, they failed).
Further we find that both the features LOS and MH_LOS describes the same feature - length of stay. Length of stay is directly correlated with our anticipated binary response variable, which will indicate a stay at the hospital of less than or equal to three days, or more than three days. Therefore, we need to cull both LOS and MH_LOS.

ER_df <- ER_df[,-which(names(ER_df) %in% c("GenSurg_LOS",
                                           "LOS",
                                           "MH_LOS"))]

From the DischargeDate we can extract three additional features: Year, Month and Weekday. This might tell us something about the level of service that might impact the stay of a patient during specific time frames in a week, month or year. Who knows, let’s see.

library(lubridate)
ER_df$DischargeDate         <- as.Date(ER_df$DischargeDate,   format = "%d-%b-%y")
ER_df$DischargeDate_Year    <- as.factor(year(ymd(ER_df$DischargeDate)))
ER_df$DischargeDate_Month   <- as.factor(month(ymd(ER_df$DischargeDate)))
ER_df$DischargeDate_Weekday <- as.factor(weekdays(ymd(ER_df$DischargeDate)))

Now we can also discard of DischargeDate all together.

ER_df <- ER_df[,-which(names(ER_df) %in% c("DischargeDate"))]

Upon further inspection we find that a number of the features are actually binary coded categorical variables such as, Death, Age60Plus, Gender, Operative, Complications and Endoscopic. We’ll convert these to proper factors.

ER_df$LOS_cat       <- as.factor(ER_df$LOS_cat)
ER_df$Age60Plus     <- as.factor(ER_df$Age60Plus)
ER_df$Death         <- as.factor(ER_df$Death)
ER_df$Operative     <- as.factor(ER_df$Operative)
ER_df$Complications <- as.factor(ER_df$Complications)
ER_df$Endoscopic    <- as.factor(ER_df$Endoscopic)

Lastly, let’s re-check for any NAs

sapply(ER_df, function(x) sum(is.na(x)))

##               LOS_cat                Gender          AdmissionAge 
##                     0                     0                     0 
##             Age60Plus                 Death         AdmissionType 
##                     0                     0                     0 
##             Operative         Complications            Endoscopic 
##                     0                     0                     0 
##    DischargeDate_Year   DischargeDate_Month DischargeDate_Weekday 
##                     0                     0                     0

Et, voila! No NAs!
Ok, let’s take a breather, step back and see what our manicured data set looks like now.

str(ER_df)

## 'data.frame':    15297 obs. of  12 variables:
##  $ LOS_cat              : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 1 1 ...
##  $ Gender               : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 1 1 1 ...
##  $ AdmissionAge         : int  55 85 53 47 83 46 60 71 53 81 ...
##  $ Age60Plus            : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 2 1 2 ...
##  $ Death                : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AdmissionType        : Factor w/ 4 levels "Day Procedure",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Operative            : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 1 1 ...
##  $ Complications        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Endoscopic           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DischargeDate_Year   : Factor w/ 7 levels "2011","2012",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ DischargeDate_Month  : Factor w/ 12 levels "1","2","3","4",..: 6 3 2 2 2 2 2 2 2 2 ...
##  $ DischargeDate_Weekday: Factor w/ 7 levels "Friday","Monday",..: 3 6 1 2 2 6 5 6 2 7 ...

We now have a tight, clean and compact data set in which every element contributes towards describing the variance in the data set. Our final data set consists of two numerical features, 9 categorical features and 1 categorical response variable. The highest number of distinct levels for the categorical variables is 12. From here we can now start applying our classification methods as promised in the Introduction.

Method

This sections reveals my ad hoc methods in terms of applying a couple of classification methods. Firts, let’s randomly split the data into a training set and a testing set at a 70:30 ratio.

set.seed(1489)
training_ind <- sample(dim(ER_df)[1], size = (0.7*dim(ER_df)[1]))
train_set <- ER_df[ training_ind,]; 
test_set  <- ER_df[-training_ind,];

The 70:30 ratio split results in the following:

dim(train_set)

## [1] 10707    12

dim(test_set)

## [1] 4590   12

Tree based Algorithms

So, at the risk of succumbing to the urge of pandering for any of the other vile lushes such as the RandomForest, XGBoost or ADABoost methods, we turn our attention to the innocent little function called, rpart and ask ourselves the question, “Is this the little engine that could?” rpart might not be one of the heavy-weight canons in the “School of Decision Trees” but, it packs a mighty wallop when understood and implemented correctly.
Trolling through the hyper parameters of rpart we realize that there are some of the parameters we can omit, such as na.action. This is due to the fact that we took really good care of those little buggers at the beginning of this paper. Another parameter called subset is also unnecessary as we took meticulous care of invoking the powers of randomization when splitting the data set up into training and testing portions. As we’ll be utilizing rpart to solve a classification problem, we need to set the “method” parameter to “class”. In the parms parameter we can specify the splitting index to either be Gini or Information.Through the cost parameter we can specify the minsplit and the cp sub-parameter. The minsplit sub-parameter specifies the minimum number of observations in a node before attempting a split. That split must decrease the overall lack of fit by a factor of the value specified by the cp sub-parameter. We specify our response variable to be the LOS_CAT (the binary categorized length of stay in the hospital) and we set our input data set as the training set.
Palatability of decision trees are presented by a combination of how accurate they are and how complex they are. Being accurate is important, but if the tree becomes too complex, then it is difficult to explain what the hell is going on. On the other hand, if a tree is very inaccurate but is simplistic and easy to explain, then it also leaves us wanting. Therefore, our quest is to search for the ultimate satisfying balance between accuracy and complexity.
The cp and minsplit parameters helps us manipulate the complexity of the decision. Whether those resulting decision trees are more or less accurate, will be determined in the following few steps. Lets create three trees with various combinations of the cp and minsplit parameters.

Figure 1: tree_large

library(rpart)
library(rpart.plot)
set.seed(1)
tree_large <- rpart(LOS_cat~., 
                    data     = train_set,
                    method   = "class",
                    parms    = list(split = "gini"),
                    control  = rpart.control(cp = 0.001, minsplit = 2))
rpart.plot(tree_large, box.palette="RdBu", shadow.col="gray", nn=TRUE)

### Figure 2: tree_medium

set.seed(1)
tree_medium <- rpart(LOS_cat~., 
                    data     = train_set,
                    method   = "class",
                    parms    = list(split = "gini"),
                    control  = rpart.control(cp = 0.004, minsplit = 50))
rpart.plot(tree_medium, box.palette="RdBu", shadow.col="gray", nn=TRUE)

### Figure 3: tree_small

set.seed(1)
tree_small <- rpart(LOS_cat~., 
                    data     = train_set,
                    method   = "class",
                    parms    = list(split = "gini"),
                    control  = rpart.control(cp = 0.01, minsplit = 20))
rpart.plot(tree_small, box.palette="RdBu", shadow.col="gray", nn=TRUE)

From the above three trees we can clearly observe that the last created decision tree, tree_small is by far the easiest to interpret. tree_medium ain’t too bad as it brings another layer of information to the user via the DischargeDate_Weekday that it adds to the tree. tree_large is too complex and frankly too small to read.

The buttery goodness of tree_medium will make me side with it. Let’s look at the summary for tree_medium.

printcp(tree_medium)

## 
## Classification tree:
## rpart(formula = LOS_cat ~ ., data = train_set, method = "class", 
##     parms = list(split = "gini"), control = rpart.control(cp = 0.004, 
##         minsplit = 50))
## 
## Variables actually used in tree construction:
## [1] AdmissionAge          Complications         DischargeDate_Weekday
## [4] DischargeDate_Year    Endoscopic            Operative            
## 
## Root node error: 2512/10707 = 0.23461
## 
## n= 10707 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.0308519      0   1.00000 1.00000 0.017455
## 2 0.0207006      2   0.93830 0.94029 0.017080
## 3 0.0097532      3   0.91760 0.91959 0.016944
## 4 0.0055732      5   0.89809 0.90366 0.016837
## 5 0.0043790      6   0.89252 0.90764 0.016864
## 6 0.0040000      7   0.88814 0.91202 0.016893

The value n = 10707 specifies the number of observation that were used to grow the tree. This is good, as it signifies that the models utilized all of the observations in the training set.
The section labelled, Variance importance does just that - it lists the order of variable importance. Six variables made it into the list and I’m particularly pleased to see DischargeDate_Weekday among them. From this list we can infer that age plays a significant role in hospital stay duration, followed by operative procedures, Endoscopic procedures and complications. From the plot of the tree_medium model we can observe that for Wednesdays, Thursdays, Fridays, Saturday and Sundays there exists a 55% probability that that 1% of all people ages less than 70 who had complications and were not operated upon will have stayed more than 3 days in the hospital upon there discharge. That is a crazy stat. All that it tells me is that the people who are discharged before their stay have reached three days at the hospital are slightly more likely to be discharged on a Monday or Tuesday.
Optimal variable importance are depicted as follows.
### Figure 4: CP vs. Relative Error

plotcp(tree_medium)

The above plot suggests that 6 features are sufficient to accurately predict our categorical response variable at the cost of becoming too complex (already around cp = 0.0074).
### Figure 4: Feature Importance

par(mar=c(3, 9, 2, 1)) 
barplot(sort(tree_medium$variable.importance,decreasing = FALSE),
        main = "Feature Importance",
        las = 1,
        cex.names = .8,
        cex.axis = .8,
        horiz = TRUE)

Figure 4 above gives us a visual idea of the ratios the features have to each other. It is interesting to see that all three date engineered features makes it onto the plot, although they don’t have a significant effect on our outcome.

The complexity table lists all the CP parameters utilized at each split of the tree (nsplit). Associated to each split the table lists the re-substitution error, cross-validation error and the standard error. We can observe that at split 6 (the chosen tree), the lowest combination of errors are incurred as follows:

CP = 0.004379
nsplit = 6
rel error = 0.8925159
xerror = 0.9076433
xstd = 0.0168636

Choosing the model with the lowest re-substitution rate does not guarantee the best table. Larger trees tend to have smaller re-substitution errors and they tend to have a bias. Large trees will put random variation in the predictions as they over fit outliers. Non-the-less, our model being inspected does not suffer from that, as it is only a medium sized tree and have in any case a fairly high re-substitution error of 0.89252. The cross validation error is 0.90764, which indicates that the cross validation effect is rather poor. However, the standard error, xstd = 0.016864 is fairly low.

Support Vector Classifiers

The e1071 library offers an implementation of our stock standard Support Vector Machine implementation in the form of the svm function. The svm function offers a dizzying wealth of hyper parameters with which one can go absolutely gang-busters. For this exercise we’ll concentrate on only a few.
First off, we’ll pin down the kernel = “linear”. The type parameter is by default set to C-classification due to the fact that our response variable is a binary factor. We also have the option of setting it to either nu-classification or one-classification. We’ll stick to the default.
The next important parameter is the cost parameter, which offers us the option to specify a cost of constraint value. As the svm documentation suggests; it’s the ‘C’-constant of the regularization term in the Lagrange formulation. In layman’s terms - the C-constant is a trade off between the training error and the flatness of the solution. The larger the C-constant is the less the final training error. If it is too large we risk loosing the generalization properties of the classifier, because it will try and fit all of the training points as good as it can. This will include attempting to fit possible errors in the data set as well. Also, a large C value increases the training time. If C is small, then the classifier is flat. So we have to find a C value that keeps the error small but also generalizes well. The same goes for the epsilon parameter (epsilon controls the width of the epsilon insensitive zone, used to fit the training data.) There are several automated ways to “tune” these variables in order to find the most optimum combination. These methods includes cross-validation and grid-search functionality. Unfortunately, these methods require large computational power and therefore will not be exercised in this paper. Therefore, we’ll have to come up with a good guesstimate for our combination of C-constant and epsilon values. We’ll set C = 10 and epsilon = 0.1.

library(e1071)
set.seed(1234)
svm_1=svm(LOS_cat~., 
          data    = train_set, 
          kernel  = "linear", 
          cost    = 10,
          epsilon = 0.1,
          scale   = FALSE,
          type    = "C-classification")

Let’s have a look at the model summary.

summary(svm_1)

## 
## Call:
## svm(formula = LOS_cat ~ ., data = train_set, kernel = "linear", 
##     cost = 10, epsilon = 0.1, type = "C-classification", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  5634
## 
##  ( 3221 2413 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

The above summary suggest a staggering 5634 support vectors, which assist in defining the classification hyper plane. Is that good, or is it bad? I don’t know. The calculated coefficients of the svm are as follows:

t(svm_1$coefs) %*% svm_1$SV

##           GenderF      GenderM  AdmissionAge   Age60Plus1      Death1
## [1,] -8.72652e-06 8.726521e-06 -2.146914e-05 0.0002416007 0.001318514
##      AdmissionTypeElective AdmissionTypeEmergency AdmissionTypeTransfer
## [1,]             -1.999932              -2.000095             -2.000116
##         Operative1 Complications1 Endoscopic1 DischargeDate_Year2012
## [1,] -5.713638e-05      -1.999979   -1.999997          -1.498089e-06
##      DischargeDate_Year2013 DischargeDate_Year2014 DischargeDate_Year2015
## [1,]           7.700351e-06           9.798443e-06            1.28131e-06
##      DischargeDate_Year2016 DischargeDate_Year2017 DischargeDate_Month2
## [1,]           7.905351e-06           2.814884e-06          1.10199e-05
##      DischargeDate_Month3 DischargeDate_Month4 DischargeDate_Month5
## [1,]         2.654063e-06        -1.064811e-05        -8.577233e-07
##      DischargeDate_Month6 DischargeDate_Month7 DischargeDate_Month8
## [1,]           1.4201e-05        -2.855224e-06         5.204672e-06
##      DischargeDate_Month9 DischargeDate_Month10 DischargeDate_Month11
## [1,]        -2.669098e-06          1.019953e-05           1.20847e-05
##      DischargeDate_Month12 DischargeDate_WeekdayMonday
## [1,]          1.495901e-05                7.905904e-06
##      DischargeDate_WeekdaySaturday DischargeDate_WeekdaySunday
## [1,]                  2.884701e-05                3.271069e-05
##      DischargeDate_WeekdayThursday DischargeDate_WeekdayTuesday
## [1,]                  1.034645e-05                 1.029017e-05
##      DischargeDate_WeekdayWednesday
## [1,]                   3.310945e-06

The listed coefficients suggests that Death1, all three admission types, Operative1, Complications1 and Endoscopic are weighted more heavily than the rest of the features. This correlates fairly well with what we have seen in the feature importance as determined by the ranking of the features in Figure 4. The two odd feature out is AdmissionAge and Age60Plus. They are listed as the most important features in Figure 4.

The e1071 package comes equipped with a tuning function, which is aptly called tune. The tune function is actually an automated hyper parameter tuning function as mentioned above. But, due to computational complexity we’ll only utilize it to find the error estimation for C = 10 and epsilon = 0.1.
The following code has been commented out due to the fact that it takes an enormous amount of time to run. The author had to run this code as a separate background job in RStudio in order to produce the results as displayed in the figure below the code.

#svm_tune = tune(svm,
#                LOS_cat ~ . ,
#                data    = train_set, 
#                kernel  = "linear",
#                scale   = FALSE,
#                type    = "C-classification",
#                cost    = 10,
#                epsilon = 0.1)

Let’s see how the tune function performed in general.

#summary(svm_tune)

The other difference between tune and svm is that tune, by default, utilizes a 10-fold cross validation method. Non-the-less, the training error is 0.2236845.
An error rate of 0.2236845 achieved for a cost = 10 and epsilon = 0.1 is not too bad. Decreasing the cost parameter will include more support vectors, because of a larger margin. This may or may not decrease the training error depending on what the effect of the epsilon = 0.1 is on the support vectors in the region of the hyper plane margin.

Now, let’s look at an alternative support vector classifier for which we’ll utilize a different kernel. We’ll discard with the linear kernel and replace it with kernel = radial.

set.seed(1)
svm_radial <- svm(LOS_cat ~ . ,
                data      = train_set, 
                cost      = 10,
                kernel    ='radial', 
                gamma     = 1, 
                coef0     = 1, 
                scale     = FALSE,
                type      = 'C-classification',
                shrinking = FALSE)

For the radial kernel we need to also specify the degree and gamma parameters.

summary(svm_radial)

## 
## Call:
## svm(formula = LOS_cat ~ ., data = train_set, cost = 10, kernel = "radial", 
##     gamma = 1, coef0 = 1, type = "C-classification", shrinking = FALSE, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  10128
## 
##  ( 7638 2490 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_radial_err <- svm_radial$tot.nSV

The radial kernel produces a hyper plane which is supported by 10128 support vectors. This is a significant increase in support vectors from the linear kernel implementation. The calculated coefficients of the svm with radial kernel are as follows:

t(svm_radial$coefs) %*% svm_radial$SV

##        GenderF  GenderM AdmissionAge Age60Plus1    Death1
## [1,] -25.57755 25.57755    -22367.23  -432.5214 -6.796783
##      AdmissionTypeElective AdmissionTypeEmergency AdmissionTypeTransfer
## [1,]              9.864269               5.380315             -16.76312
##      Operative1 Complications1 Endoscopic1 DischargeDate_Year2012
## [1,]  -328.7197      -194.4047   -132.3492              -10.99059
##      DischargeDate_Year2013 DischargeDate_Year2014 DischargeDate_Year2015
## [1,]              -14.22897               55.13576              -3.512147
##      DischargeDate_Year2016 DischargeDate_Year2017 DischargeDate_Month2
## [1,]                59.5915               12.22683             22.21022
##      DischargeDate_Month3 DischargeDate_Month4 DischargeDate_Month5
## [1,]            -29.66518            -11.65264             3.755503
##      DischargeDate_Month6 DischargeDate_Month7 DischargeDate_Month8
## [1,]            -11.38685            -15.35463             6.676831
##      DischargeDate_Month9 DischargeDate_Month10 DischargeDate_Month11
## [1,]            -11.30463              21.88923              26.22187
##      DischargeDate_Month12 DischargeDate_WeekdayMonday
## [1,]              21.90171                    -22.3846
##      DischargeDate_WeekdaySaturday DischargeDate_WeekdaySunday
## [1,]                      71.18845                    83.65958
##      DischargeDate_WeekdayThursday DischargeDate_WeekdayTuesday
## [1,]                      8.063863                    -31.17331
##      DischargeDate_WeekdayWednesday
## [1,]                      -27.99791

The coefficients for all the features differ significantly from the svm with kernel = “linear”. With the radial kernel the importance of the features are very similar to the ranking of the feature for the decision tree. The big difference in feature importance between the linear and the radial kernel implementations are that weight assigned to the AdmissionAge feature.

We, again, utilize the tune function to determine the performance error on the training data.

set.seed(1)
svm_radial_tune = tune(svm,
                LOS_cat ~ . ,
                data    = train_set, 
                cost      = 10,
                kernel    ='radial', 
                gamma     = 1, 
                coef0     = 1, 
                scale     = FALSE,
                type      = 'C-classification',
                shrinking = FALSE)

summary(svm_radial_tune)

## 
## Error estimation of 'svm' using 10-fold cross validation: 0.2486216

The error for kernel = “radial”, 0.2486216 is slightly worse than the svm with kernel = “linear”. The radial kernel support vector machines is a good approach when the data is not linearly seperable. The idea behind generating non-linear decision boundaries is that we need to do some nonlinear transformations on the features, which transforms them into a higher dimensional space. We do this non-linear transformation using the Kernel trick. Same as with the linear kernel there are a number of parameters that we can tune to optimise the radial kernel. However, manipulating the feature space into higher dimensions are outside the scope of this paper but, remains an area of interest to the author. Also, the area of research on the realtionship between the radial base function and the semi-Lagrangian seems to be buzzing.

Time to do some predicitons

First, let’s evaluate the decision tree model.

#PREDICTION
set.seed(1)
pre <- predict(tree_medium, test_set, type="class")
tab <- table(pre, test_set$LOS_cat)
accuracy <- sum(diag(tab))/sum(tab); accuracy

## [1] 0.7869281

error <- 0
error <- error + (1 - accuracy); error

## [1] 0.2130719

For the medium_tree model the error rate, against the test set, is 0.2130719.

Second, let’s evaluate the first support vector model (kernel = linear, cost = 10, epsilon = 0.1).

#PREDICTION
set.seed(1)
pre_svm <- predict(svm_1, test_set, type="class")
tab_svm <- table(pre_svm, test_set$LOS_cat)
accuracy_svm <- sum(diag(tab_svm))/sum(tab_svm); accuracy_svm

## [1] 0.7740741

error_svm <- 0
error_svm <- error_svm + (1 - accuracy_svm); error_svm

## [1] 0.2259259

For the svm_1 model the error rate, against the test set, is 0.2259259.

Third, let’s evaluate the second support vector model (kernel = polynomial, cost = 10, gamma = 1).

#PREDICTION
set.seed(1)
pre_svm_2 <- predict(svm_radial, test_set, type="class")
tab_svm_2 <- table(pre_svm_2, test_set$LOS_cat)
accuracy_svm_2 <- sum(diag(tab_svm_2))/sum(tab_svm_2); accuracy_svm_2

## [1] 0.7488017

error_svm_2 <- 0
error_svm_2 <- error_svm_2 + (1 - accuracy_svm_2); error_svm_2

## [1] 0.2511983

For the svm_1 model the error rate, against the test set, is 0.2511983.

From the three results above we can observe that the decision tree returned the most accurate prediction with an error of only 0.2130719. In a close second place, the support vector model with kernel - “linear” has a prediction error rate of 0.2259259. The support vector model with kernel = polynomial returned a rather disappointing result with an error of 0.2511983.
It is a close call between the decision tree and the linear kernel support vector model but, we have to accept the decision tree as the winner. The linear kernel support vector model performed well as we can now assume that there exists a linear relationship between the response variable and the feature variables. It is for this reason then that the polynomial kernel support vector model performed so poorly. Between the support vector classifier and the decision tree I pick the decision tree method. One of the main reasons, besides the fact that it performed marginally better, is that our data set has a mixture of numerical and categorical features. I feel that decision trees perform better, in general, with categorical variables where as support vector classifiers performs better with numerical features.
So, in conclusion, to answer our initial question - yes, svm is the little engine that can.