1 Introduction

In this report I use the dataset from a kaggle challenge Forest Cover Type Prediction to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables.

These independent variables were erived from data obtained from the US Geological Survey and USFS.

Six models are involved in this report. They are LDA (Linear Discriminant Analysis), Naive Bayes, kNN (k-NearestNeighbor), Decision Trees, Random Forest and Boosted Trees.

The report can also be used as a step-by-step tutorial for selecting features, cross validation, building models using primary machine learning algorithms and a beginning of kaggle challenges.


2 Preparation

2.1 Loading libraires


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: 'GGally'
The following object is masked from 'package:dplyr':

    nasa
Loading required package: Rcpp
## 
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
Loading required package: gplots

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
Loading required package: lattice
Version:  1.36.23
Date:     2017-03-03
Author:   Philip Leifeld (University of Glasgow)

Please cite the JSS article in your publications -- see citation("texreg").
corrplot 0.84 loaded

2.2 Data cleaning and exploration

'data.frame':   581012 obs. of  56 variables:
 $ Id                                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Elevation                         : int  2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
 $ Aspect                            : int  51 56 139 155 45 132 45 49 45 59 ...
 $ Slope                             : int  3 2 9 18 2 6 7 4 9 10 ...
 $ Horizontal_Distance_To_Hydrology  : int  258 212 268 242 153 300 270 234 240 247 ...
 $ Vertical_Distance_To_Hydrology    : int  0 -6 65 118 -1 -15 5 7 56 11 ...
 $ Horizontal_Distance_To_Roadways   : int  510 390 3180 3090 391 67 633 573 666 636 ...
 $ Hillshade_9am                     : int  221 220 234 238 220 230 222 222 223 228 ...
 $ Hillshade_Noon                    : int  232 235 238 238 234 237 225 230 221 219 ...
 $ Hillshade_3pm                     : int  148 151 135 122 150 140 138 144 133 124 ...
 $ Horizontal_Distance_To_Fire_Points: int  6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
 $ Wilderness_Area1                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Wilderness_Area2                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Wilderness_Area3                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Wilderness_Area4                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type1                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type2                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type3                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type4                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type5                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type6                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type7                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type8                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type9                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type10                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type11                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type12                       : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Soil_Type13                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type14                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type15                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type16                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type17                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type18                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type19                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type20                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type21                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type22                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type23                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type24                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type25                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type26                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type27                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type28                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type29                       : int  1 1 0 0 1 1 1 1 1 1 ...
 $ Soil_Type30                       : int  0 0 0 1 0 0 0 0 0 0 ...
 $ Soil_Type31                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type32                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type33                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type34                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type35                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type36                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type37                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type38                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type39                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type40                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Cover_Type                        : int  5 5 2 2 5 2 5 5 5 5 ...

2.2.1 Missing Data Imputation

                                Id                          Elevation 
                                 0                                  0 
                            Aspect                              Slope 
                                 0                                  0 
  Horizontal_Distance_To_Hydrology     Vertical_Distance_To_Hydrology 
                                 0                                  0 
   Horizontal_Distance_To_Roadways                      Hillshade_9am 
                                 0                                  0 
                    Hillshade_Noon                      Hillshade_3pm 
                                 0                                  0 
Horizontal_Distance_To_Fire_Points                   Wilderness_Area1 
                                 0                                  0 
                  Wilderness_Area2                   Wilderness_Area3 
                                 0                                  0 
                  Wilderness_Area4                         Soil_Type1 
                                 0                                  0 
                        Soil_Type2                         Soil_Type3 
                                 0                                  0 
                        Soil_Type4                         Soil_Type5 
                                 0                                  0 
                        Soil_Type6                         Soil_Type7 
                                 0                                  0 
                        Soil_Type8                         Soil_Type9 
                                 0                                  0 
                       Soil_Type10                        Soil_Type11 
                                 0                                  0 
                       Soil_Type12                        Soil_Type13 
                                 0                                  0 
                       Soil_Type14                        Soil_Type15 
                                 0                                  0 
                       Soil_Type16                        Soil_Type17 
                                 0                                  0 
                       Soil_Type18                        Soil_Type19 
                                 0                                  0 
                       Soil_Type20                        Soil_Type21 
                                 0                                  0 
                       Soil_Type22                        Soil_Type23 
                                 0                                  0 
                       Soil_Type24                        Soil_Type25 
                                 0                                  0 
                       Soil_Type26                        Soil_Type27 
                                 0                                  0 
                       Soil_Type28                        Soil_Type29 
                                 0                                  0 
                       Soil_Type30                        Soil_Type31 
                                 0                                  0 
                       Soil_Type32                        Soil_Type33 
                                 0                                  0 
                       Soil_Type34                        Soil_Type35 
                                 0                                  0 
                       Soil_Type36                        Soil_Type37 
                                 0                                  0 
                       Soil_Type38                        Soil_Type39 
                                 0                                  0 
                       Soil_Type40                         Cover_Type 
                                 0                             565892 

Missing values of Covertype all come from the test data so we can just continue without deletion. Hence, it seems that all rows can be used

However, some columns have too many 0, which can result in subset full of 0 in the following split. So we still need to find these columns in advance.

2.2.2 Further cleaning and reorganization

[1] 3031
[1] 7525
[1] 4823
[1] 12396
[1] 1597
[1] 6575
[1] 105
[1] 179
[1] 1147
[1] 32634
[1] 12410
[1] 29971
[1] 17431
[1] 599
[1] 3
[1] 2845
[1] 3422
[1] 1899
[1] 4021
[1] 9259
[1] 838
[1] 33373
[1] 57752
[1] 21278
[1] 474
[1] 2589
[1] 1086
[1] 946
[1] 115247
[1] 30170
[1] 25666
[1] 52519
[1] 45154
[1] 1611
[1] 1891
[1] 119
[1] 298
[1] 15573
[1] 13806
[1] 8750
# integrating forty soil types into a comprehensive new variable
numSoil1 <- which(whole$Soil_Type1 == 1)
numSoil2 <- which(whole$Soil_Type2 == 1)
numSoil3 <- which(whole$Soil_Type3 == 1)
numSoil4 <- which(whole$Soil_Type4 == 1)
numSoil5 <- which(whole$Soil_Type5 == 1)
numSoil6 <- which(whole$Soil_Type6 == 1)
numSoil7 <- which(whole$Soil_Type7 == 1)
numSoil8 <- which(whole$Soil_Type8 == 1)
numSoil9 <- which(whole$Soil_Type9 == 1)
numSoil10 <- which(whole$Soil_Type10 == 1)
numSoil11 <- which(whole$Soil_Type11 == 1)
numSoil12 <- which(whole$Soil_Type12 == 1)
numSoil13 <- which(whole$Soil_Type13 == 1)
numSoil14 <- which(whole$Soil_Type14 == 1)
numSoil15 <- which(whole$Soil_Type15 == 1)
numSoil16 <- which(whole$Soil_Type16 == 1)
numSoil17 <- which(whole$Soil_Type17 == 1)
numSoil18 <- which(whole$Soil_Type18 == 1)
numSoil19 <- which(whole$Soil_Type19 == 1)
numSoil20 <- which(whole$Soil_Type20 == 1)
numSoil21 <- which(whole$Soil_Type21 == 1)
numSoil22 <- which(whole$Soil_Type22 == 1)
numSoil23 <- which(whole$Soil_Type23 == 1)
numSoil24 <- which(whole$Soil_Type24 == 1)
numSoil25 <- which(whole$Soil_Type25 == 1)
numSoil26 <- which(whole$Soil_Type26 == 1)
numSoil27 <- which(whole$Soil_Type27 == 1)
numSoil28 <- which(whole$Soil_Type28 == 1)
numSoil29 <- which(whole$Soil_Type29 == 1)
numSoil30 <- which(whole$Soil_Type30 == 1)
numSoil31 <- which(whole$Soil_Type31 == 1)
numSoil32 <- which(whole$Soil_Type32 == 1)
numSoil33 <- which(whole$Soil_Type33 == 1)
numSoil34 <- which(whole$Soil_Type34 == 1)
numSoil35 <- which(whole$Soil_Type35 == 1)
numSoil36 <- which(whole$Soil_Type36 == 1)
numSoil37 <- which(whole$Soil_Type37 == 1)
numSoil38 <- which(whole$Soil_Type38 == 1)
numSoil39 <- which(whole$Soil_Type39 == 1)
numSoil40 <- which(whole$Soil_Type40 == 1)

whole$Soil_Type <- 0

whole[numSoil1, which(names(whole) == "Soil_Type")] <- 1
whole[numSoil2, which(names(whole) == "Soil_Type")] <- 2
whole[numSoil3, which(names(whole) == "Soil_Type")] <- 3
whole[numSoil4, which(names(whole) == "Soil_Type")] <- 4
whole[numSoil5, which(names(whole) == "Soil_Type")] <- 5
whole[numSoil6, which(names(whole) == "Soil_Type")] <- 6
whole[numSoil7, which(names(whole) == "Soil_Type")] <- 7
whole[numSoil8, which(names(whole) == "Soil_Type")] <- 8
whole[numSoil9, which(names(whole) == "Soil_Type")] <- 9
whole[numSoil10, which(names(whole) == "Soil_Type")] <- 10
whole[numSoil11, which(names(whole) == "Soil_Type")] <- 11
whole[numSoil12, which(names(whole) == "Soil_Type")] <- 12
whole[numSoil13, which(names(whole) == "Soil_Type")] <- 13
whole[numSoil14, which(names(whole) == "Soil_Type")] <- 14
whole[numSoil15, which(names(whole) == "Soil_Type")] <- 15
whole[numSoil16, which(names(whole) == "Soil_Type")] <- 16
whole[numSoil17, which(names(whole) == "Soil_Type")] <- 17
whole[numSoil18, which(names(whole) == "Soil_Type")] <- 18
whole[numSoil19, which(names(whole) == "Soil_Type")] <- 19
whole[numSoil20, which(names(whole) == "Soil_Type")] <- 20
whole[numSoil21, which(names(whole) == "Soil_Type")] <- 21
whole[numSoil22, which(names(whole) == "Soil_Type")] <- 22
whole[numSoil23, which(names(whole) == "Soil_Type")] <- 23
whole[numSoil24, which(names(whole) == "Soil_Type")] <- 24
whole[numSoil25, which(names(whole) == "Soil_Type")] <- 25
whole[numSoil26, which(names(whole) == "Soil_Type")] <- 26
whole[numSoil27, which(names(whole) == "Soil_Type")] <- 27
whole[numSoil28, which(names(whole) == "Soil_Type")] <- 28
whole[numSoil29, which(names(whole) == "Soil_Type")] <- 29
whole[numSoil30, which(names(whole) == "Soil_Type")] <- 30
whole[numSoil31, which(names(whole) == "Soil_Type")] <- 31
whole[numSoil32, which(names(whole) == "Soil_Type")] <- 32
whole[numSoil33, which(names(whole) == "Soil_Type")] <- 33
whole[numSoil34, which(names(whole) == "Soil_Type")] <- 34
whole[numSoil35, which(names(whole) == "Soil_Type")] <- 35
whole[numSoil36, which(names(whole) == "Soil_Type")] <- 36
whole[numSoil37, which(names(whole) == "Soil_Type")] <- 37
whole[numSoil38, which(names(whole) == "Soil_Type")] <- 38
whole[numSoil39, which(names(whole) == "Soil_Type")] <- 39
whole[numSoil40, which(names(whole) == "Soil_Type")] <- 40
[1] 1 1 1 1 1 1
Levels: 1 2 3 4
[1] 29 29 12 30 29 29
40 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 40

3 EDA

It can be seen that soil type whose index numbers are closing also have similar relationship with cover types.

Warning in cor(train_num): 标准差为零

Here we find the columns full of 0. We will delete them later.


4 Cross Validation

Another correlation to see the correlation involving categorical variables.

library(polycor)
#hetcor(train)
  Elevation Aspect Slope Horizontal_Distance_To_Hydrology
1      2596     51     3                              258
2      2590     56     2                              212
3      2804    139     9                              268
4      2785    155    18                              242
5      2595     45     2                              153
6      2579    132     6                              300
  Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
1                              0                             510
2                             -6                             390
3                             65                            3180
4                            118                            3090
5                             -1                             391
6                            -15                              67
  Hillshade_9am Hillshade_Noon Hillshade_3pm
1           221            232           148
2           220            235           151
3           234            238           135
4           238            238           122
5           220            234           150
6           230            237           140
  Horizontal_Distance_To_Fire_Points Soil_Type1 Soil_Type2 Soil_Type3
1                               6279          0          0          0
2                               6225          0          0          0
3                               6121          0          0          0
4                               6211          0          0          0
5                               6172          0          0          0
6                               6031          0          0          0
  Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type8 Soil_Type9 Soil_Type10
1          0          0          0          0          0           0
2          0          0          0          0          0           0
3          0          0          0          0          0           0
4          0          0          0          0          0           0
5          0          0          0          0          0           0
6          0          0          0          0          0           0
  Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type16 Soil_Type17
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           1           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21 Soil_Type22 Soil_Type23
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           0           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27 Soil_Type28 Soil_Type29
1           0           0           0           0           0           1
2           0           0           0           0           0           1
3           0           0           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           1
6           0           0           0           0           0           1
  Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           0           0           0           0           0
4           1           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
1           0           0           0           0           0          5
2           0           0           0           0           0          5
3           0           0           0           0           0          2
4           0           0           0           0           0          2
5           0           0           0           0           0          5
6           0           0           0           0           0          2
  Wilderness_Area
1               1
2               1
3               1
4               1
5               1
6               1


5 t-test


    Welch Two Sample t-test

data:  Elevation by Soil_Type1
t = 66.578, df = 409.56, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 587.9997 623.7783
sample estimates:
mean in group 0 mean in group 1 
       2764.166        2158.277 

It is self-evident that there is a strong relationship between soil type and elevation.


6 Model building

6.1 LDA

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold06.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold09.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold04.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold07.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold07.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold10.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
trainInfo, : There were missing values in resampled performance measures.
Warning in lda.default(x, grouping, ...): variables are collinear
            Length Class      Mode     
prior         7    -none-     numeric  
counts        7    -none-     numeric  
means       357    -none-     numeric  
scaling     306    -none-     numeric  
lev           7    -none-     character
svd           6    -none-     numeric  
N             1    -none-     numeric  
call          3    -none-     call     
xNames       51    -none-     character
problemType   1    -none-     character
tuneValue     1    data.frame list     
obsLevels     7    -none-     character
param         0    -none-     list     
Linear Discriminant Analysis 

12096 samples
   49 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 10886, 10885, 10888, 10887, 10886, 10886, ... 
Resampling results:

  Accuracy   Kappa    
  0.6465129  0.5875977

6.2 Naive Bayes

I use smaller subsets of the original dataset or comment out some lines as the output of html is too too too slow. However, I’ve run a complete model before, so the following comparison is based on that big model.

# naive bayes model building and CV
model_nb = train(factor(Cover_Type) ~ .,
                  data=temp_train,
                  method="nb",
                  #trControl = train.control
                  )

6.4 Decision Tree

6.4.1 Classifier Tree 1

Warning: labs do not fit even at cex 0.15, there may be some overplotting

CART 

12096 samples
   49 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 10886, 10886, 10888, 10886, 10886, 10886, ... 
Resampling results across tuning parameters:

  cp           Accuracy   Kappa      
  0.000000000  0.7717194  0.733672563
  0.001683502  0.7130987  0.665281903
  0.003367003  0.6594155  0.602650911
  0.005050505  0.6372051  0.576738573
  0.006734007  0.6309780  0.569472556
  0.008417508  0.6283895  0.566452139
  0.010101010  0.6228220  0.559956509
  0.011784512  0.6161248  0.552142184
  0.013468013  0.6143892  0.550117813
  0.015151515  0.6143892  0.550117813
  0.016835017  0.6143892  0.550117813
  0.018518519  0.6143892  0.550117813
  0.020202020  0.6091527  0.544008185
  0.021885522  0.5893140  0.520863516
  0.023569024  0.5792819  0.509156570
  0.025252525  0.5652540  0.492793748
  0.026936027  0.5585867  0.485017890
  0.028619529  0.5585867  0.485017890
  0.030303030  0.5585867  0.485017890
  0.031986532  0.5585867  0.485017890
  0.033670034  0.5585867  0.485017890
  0.035353535  0.5585867  0.485017890
  0.037037037  0.5585867  0.485017890
  0.038720539  0.5585867  0.485017890
  0.040404040  0.5585867  0.485017890
  0.042087542  0.5585867  0.485017890
  0.043771044  0.5585867  0.485017890
  0.045454545  0.5585867  0.485017890
  0.047138047  0.5585867  0.485017890
  0.048821549  0.5585867  0.485017890
  0.050505051  0.5585867  0.485017890
  0.052188552  0.5585867  0.485017890
  0.053872054  0.5585867  0.485017890
  0.055555556  0.5585867  0.485017890
  0.057239057  0.5585867  0.485017890
  0.058922559  0.5585867  0.485017890
  0.060606061  0.5585867  0.485017890
  0.062289562  0.5585867  0.485017890
  0.063973064  0.5585867  0.485017890
  0.065656566  0.5585867  0.485017890
  0.067340067  0.5585867  0.485017890
  0.069023569  0.5585867  0.485017890
  0.070707071  0.5585867  0.485017890
  0.072390572  0.5585867  0.485017890
  0.074074074  0.5585867  0.485017890
  0.075757576  0.5585867  0.485017890
  0.077441077  0.5585867  0.485017890
  0.079124579  0.5585867  0.485017890
  0.080808081  0.5585867  0.485017890
  0.082491582  0.5536787  0.479291553
  0.084175084  0.5346592  0.457101785
  0.085858586  0.5150726  0.434253578
  0.087542088  0.4866347  0.401073020
  0.089225589  0.4839922  0.397990169
  0.090909091  0.4788365  0.391970848
  0.092592593  0.4788365  0.391970848
  0.094276094  0.4788365  0.391970848
  0.095959596  0.4701541  0.381844445
  0.097643098  0.4217501  0.325375967
  0.099326599  0.4155718  0.318163962
  0.101010101  0.4044860  0.305233095
  0.102693603  0.4044860  0.305233095
  0.104377104  0.4044860  0.305233095
  0.106060606  0.4044860  0.305233095
  0.107744108  0.4044860  0.305233095
  0.109427609  0.4044860  0.305233095
  0.111111111  0.4044860  0.305233095
  0.112794613  0.4044860  0.305233095
  0.114478114  0.4044860  0.305233095
  0.116161616  0.4044860  0.305233095
  0.117845118  0.4044860  0.305233095
  0.119528620  0.4044860  0.305233095
  0.121212121  0.4044860  0.305233095
  0.122895623  0.4044860  0.305233095
  0.124579125  0.4044860  0.305233095
  0.126262626  0.4044860  0.305233095
  0.127946128  0.4044860  0.305233095
  0.129629630  0.4044860  0.305233095
  0.131313131  0.4044860  0.305233095
  0.132996633  0.4044860  0.305233095
  0.134680135  0.4044860  0.305233095
  0.136363636  0.4005401  0.300627620
  0.138047138  0.3810510  0.277889517
  0.139730640  0.3294742  0.217720655
  0.141414141  0.3098534  0.194831618
  0.143097643  0.2939547  0.176282453
  0.144781145  0.2857143  0.166666692
  0.146464646  0.2857143  0.166666692
  0.148148148  0.2857143  0.166666692
  0.149831650  0.2857143  0.166666692
  0.151515152  0.2857143  0.166666692
  0.153198653  0.2857143  0.166666692
  0.154882155  0.2857143  0.166666692
  0.156565657  0.2857143  0.166666692
  0.158249158  0.2857143  0.166666692
  0.159932660  0.2857143  0.166666692
  0.161616162  0.2857143  0.166666692
  0.163299663  0.2857143  0.166666692
  0.164983165  0.2857143  0.166666692
  0.166666667  0.1471308  0.005528769

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

6.4.2 Try to improve the Classifier Tree

Warning: labs do not fit even at cex 0.15, there may be some overplotting

CART 

12096 samples
   20 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 10885, 10887, 10886, 10886, 10886, 10887, ... 
Resampling results across tuning parameters:

  cp           Accuracy   Kappa    
  0.000000000  0.7628125  0.7232804
  0.001683502  0.7115559  0.6634826
  0.003367003  0.6550894  0.5976026
  0.005050505  0.6449210  0.5857416
  0.006734007  0.6310333  0.5695378
  0.008417508  0.6292977  0.5675129
  0.010101010  0.6221046  0.5591208
  0.011784512  0.6154099  0.5513118
  0.013468013  0.6141696  0.5498629
  0.015151515  0.6141696  0.5498629
  0.016835017  0.6141696  0.5498629
  0.018518519  0.6141696  0.5498629
  0.020202020  0.6095436  0.5444658
  0.021885522  0.5929237  0.5250756
  0.023569024  0.5774631  0.5070431
  0.025252525  0.5631631  0.4903568
  0.026936027  0.5596076  0.4862085
  0.028619529  0.5596076  0.4862085
  0.030303030  0.5596076  0.4862085
  0.031986532  0.5596076  0.4862085
  0.033670034  0.5596076  0.4862085
  0.035353535  0.5596076  0.4862085
  0.037037037  0.5596076  0.4862085
  0.038720539  0.5596076  0.4862085
  0.040404040  0.5596076  0.4862085
  0.042087542  0.5596076  0.4862085
  0.043771044  0.5596076  0.4862085
  0.045454545  0.5596076  0.4862085
  0.047138047  0.5596076  0.4862085
  0.048821549  0.5596076  0.4862085
  0.050505051  0.5596076  0.4862085
  0.052188552  0.5596076  0.4862085
  0.053872054  0.5596076  0.4862085
  0.055555556  0.5596076  0.4862085
  0.057239057  0.5596076  0.4862085
  0.058922559  0.5596076  0.4862085
  0.060606061  0.5596076  0.4862085
  0.062289562  0.5596076  0.4862085
  0.063973064  0.5596076  0.4862085
  0.065656566  0.5596076  0.4862085
  0.067340067  0.5596076  0.4862085
  0.069023569  0.5596076  0.4862085
  0.070707071  0.5596076  0.4862085
  0.072390572  0.5596076  0.4862085
  0.074074074  0.5596076  0.4862085
  0.075757576  0.5596076  0.4862085
  0.077441077  0.5596076  0.4862085
  0.079124579  0.5596076  0.4862085
  0.080808081  0.5596076  0.4862085
  0.082491582  0.5526712  0.4781160
  0.084175084  0.5385272  0.4616040
  0.085858586  0.5228183  0.4432765
  0.087542088  0.4876821  0.4022962
  0.089225589  0.4876821  0.4022962
  0.090909091  0.4876821  0.4022962
  0.092592593  0.4876821  0.4022962
  0.094276094  0.4876821  0.4022962
  0.095959596  0.4702369  0.3819302
  0.097643098  0.4213806  0.3249348
  0.099326599  0.4125303  0.3146066
  0.101010101  0.4045964  0.3053619
  0.102693603  0.4045964  0.3053619
  0.104377104  0.4045964  0.3053619
  0.106060606  0.4045964  0.3053619
  0.107744108  0.4045964  0.3053619
  0.109427609  0.4045964  0.3053619
  0.111111111  0.4045964  0.3053619
  0.112794613  0.4045964  0.3053619
  0.114478114  0.4045964  0.3053619
  0.116161616  0.4045964  0.3053619
  0.117845118  0.4045964  0.3053619
  0.119528620  0.4045964  0.3053619
  0.121212121  0.4045964  0.3053619
  0.122895623  0.4045964  0.3053619
  0.124579125  0.4045964  0.3053619
  0.126262626  0.4045964  0.3053619
  0.127946128  0.4045964  0.3053619
  0.129629630  0.4045964  0.3053619
  0.131313131  0.4045964  0.3053619
  0.132996633  0.4045964  0.3053619
  0.134680135  0.4045964  0.3053619
  0.136363636  0.4045964  0.3053619
  0.138047138  0.3803710  0.2770475
  0.139730640  0.3215085  0.2083673
  0.141414141  0.2857143  0.1666670
  0.143097643  0.2857143  0.1666670
  0.144781145  0.2857143  0.1666670
  0.146464646  0.2857143  0.1666670
  0.148148148  0.2857143  0.1666670
  0.149831650  0.2857143  0.1666670
  0.151515152  0.2857143  0.1666670
  0.153198653  0.2857143  0.1666670
  0.154882155  0.2857143  0.1666670
  0.156565657  0.2857143  0.1666670
  0.158249158  0.2857143  0.1666670
  0.159932660  0.2857143  0.1666670
  0.161616162  0.2857143  0.1666670
  0.163299663  0.2857143  0.1666670
  0.164983165  0.2857143  0.1666670
  0.166666667  0.1422784  0.0000000

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

6.6 Boosted Trees

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

6.7 Prediction

6.7.1 On validation set

   y_pred_lda
      1   2   3   4   5   6   7
  1 292  81   1   0  15   2  41
  2  99 219  11   0  79  21   3
  3   0   2 228  46  17 139   0
  4   0   0  53 334   0  45   0
  5  11  82  34   0 288  17   0
  6   0  16 106  25  28 257   0
  7  85   0   2   0   1   0 344
[1] "Accuracy 0.6488"
# prediction
y_pred_nb = predict(model_nb, newdata = validation_temp)
# Checking the prediction accuracy
table(validation_temp$Cover_Type, y_pred_nb) # Confusion matrix
error_nb <- mean(validation_temp$Cover_Type != y_pred_nb) # Misclassification error
paste('Accuracy',round(1-error_nb,4))

“Accuracy 0.64”

Althou in this demo, the NB model seems so-so, the complete model I run before actually gave a very bad prediction about 0.2. One of the reason may be the NB model assumes independence among variables. This apparently conflicts with our data.

   y_pred_knn
      1   2   3   4   5   6   7
  1 303  67   0   0  16   1  45
  2  81 262  16   0  41  20  12
  3   0  11 348  17   6  50   0
  4   0   0   3 419   0  10   0
  5   3   9   4   0 414   2   0
  6   1   5  29  19   3 375   0
  7   7   2   0   0   0   0 423
[1] "Accuracy 0.8413"
   y_pred_tree
      1   2   3   4   5   6   7
  1 290  90   1   0   9   3  39
  2  95 238  16   0  56  23   4
  3   0   6 306  27   9  84   0
  4   0   0  15 412   0   5   0
  5   3  26   8   0 387   8   0
  6   1   1 100  29  10 291   0
  7  31   3   0   0   1   0 397
[1] "Accuracy 0.7675"
   y_pred_tree2
      1   2   3   4   5   6   7
  1 286 100   1   0  10   4  31
  2  93 233  14   0  63  22   7
  3   2   5 320  19   7  79   0
  4   0   0  21 404   0   7   0
  5   4  26   5   0 390   7   0
  6   2   2  89  30  13 296   0
  7  36   6   0   0   1   0 389
[1] "Accuracy 0.7665"
   y_pred_rf
      1   2   3   4   5   6   7
  1 320  58   0   0  11   3  40
  2  92 254  19   0  43  17   7
  3   0  10 352  14   5  51   0
  4   0   0   7 424   0   1   0
  5   1   6   6   0 417   2   0
  6   1   0  49  17   3 362   0
  7  10   1   0   0   2   0 419
[1] "Accuracy 0.8426"
   y_pred_btree
     1  2  5
  1  5 10  0
  2  5 26  1
  5  0  0  3
[1] "Accuracy 0.68"

Random forest combines the output of multiple decision trees which are somehow “biased” in evaluation. Thus the RF model improves a lot on the foundation of Decision Trees.

6.7.2 On test set

To make the output of html faster, I comment out these lines about writing csv files.

test_original$Cover_Type = predict(model_lda, newdata = test_original)
write.csv(test_original, file = "predicted_lda.csv")

6.7.2.1 kaggle score: 0.58346

test_original$Cover_Type = predict(model_nb, newdata = test_original)
write.csv(test_original, file = "predicted_nb.csv")
test_original$Cover_Type = predict(classifier_knn, newdata = test_original)
write.csv(test_original, file = "predicted_knn.csv")

6.7.2.2 kaggle score 0.68723

test_original$Cover_Type = predict(classifier_tree, newdata = test_original)
write.csv(test_original, file = "predicted_decision_tree.csv")

6.7.2.3 kaggle Score 0.63211

test_original$Cover_Type = predict(classifier_forest, newdata = test_original)
write.csv(test_original, file = "predicted_random_forest.csv")

6.7.2.4 kaggle score 0.6884

There is no kaggle score of boosted trees as the above model is just a demo on a limited dataset because of the lack of computing power on the laptop. However, the performance has already been really good on the subset, which is far beyond a random guess among seven types. Thus, we can confidently estimate that the boosted trees model will also give a good performance with the whole training data.

7 Brief Summary

7.1 A Comparison Table of the Quality of Models

Id Model Score
1 LDA 0.58346
2 Naive Bayes too slow to predict
3 kNN 0.68723
4 Decision Tree 0.63211
5 Random Forest 0.68840
6 Bossted Trees lack

To sum up, the Random Forest model is the optimal model among all six models. Although some models can get an accuracy around 0.8 on validation set, they cannot go beyond 0.7 when evaluated by kaggle system. This may due to the much larger size of test dataset than the training dataset.