1 Introduction

In this report I use the dataset from a kaggle challenge Forest Cover Type Prediction to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables.

These independent variables were erived from data obtained from the US Geological Survey and USFS.

Six models are involved in this report. They are LDA (Linear Discriminant Analysis), Naive Bayes, kNN (k-NearestNeighbor), Decision Trees, Random Forest and Boosted Trees.

The report can also be used as a step-by-step tutorial for selecting features, cross validation, building models using primary machine learning algorithms and a beginning of kaggle challenges.

2 Preparation

2.1 Loading libraires

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2


Attaching package: 'GGally'

The following object is masked from 'package:dplyr':

    nasa

library(Amelia)

Loading required package: Rcpp

## 
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##

library(scales) # Visualization
library(caTools) # Prediction: Splitting Data
library(car) # Prediction: Checking Multicollinearity

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

library(ROCR) # Prediction: ROC Curve

Loading required package: gplots


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

library(e1071) # Prediction: SVM, Naive Bayes, Parameter Tuning
library(rpart) # Prediction: Decision Tree
library(rpart.plot) # Prediction: Decision Tree
library(randomForest) # Prediction: Random Forest

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

library(caret) # Prediction: k-Fold Cross Validatio

Loading required package: lattice

library(texreg)

Version:  1.36.23
Date:     2017-03-03
Author:   Philip Leifeld (University of Glasgow)

Please cite the JSS article in your publications -- see citation("texreg").

library(corrplot)

corrplot 0.84 loaded

2.2 Data cleaning and exploration

# loading the data
train <- read.csv(file.choose())
test <- read.csv(file.choose())

# creating a new data set including both train and test sets
whole = bind_rows(train, test)
# checking the info about the whole data
str(whole)

'data.frame':   581012 obs. of  56 variables:
 $ Id                                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Elevation                         : int  2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
 $ Aspect                            : int  51 56 139 155 45 132 45 49 45 59 ...
 $ Slope                             : int  3 2 9 18 2 6 7 4 9 10 ...
 $ Horizontal_Distance_To_Hydrology  : int  258 212 268 242 153 300 270 234 240 247 ...
 $ Vertical_Distance_To_Hydrology    : int  0 -6 65 118 -1 -15 5 7 56 11 ...
 $ Horizontal_Distance_To_Roadways   : int  510 390 3180 3090 391 67 633 573 666 636 ...
 $ Hillshade_9am                     : int  221 220 234 238 220 230 222 222 223 228 ...
 $ Hillshade_Noon                    : int  232 235 238 238 234 237 225 230 221 219 ...
 $ Hillshade_3pm                     : int  148 151 135 122 150 140 138 144 133 124 ...
 $ Horizontal_Distance_To_Fire_Points: int  6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
 $ Wilderness_Area1                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Wilderness_Area2                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Wilderness_Area3                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Wilderness_Area4                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type1                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type2                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type3                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type4                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type5                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type6                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type7                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type8                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type9                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type10                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type11                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type12                       : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Soil_Type13                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type14                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type15                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type16                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type17                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type18                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type19                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type20                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type21                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type22                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type23                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type24                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type25                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type26                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type27                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type28                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type29                       : int  1 1 0 0 1 1 1 1 1 1 ...
 $ Soil_Type30                       : int  0 0 0 1 0 0 0 0 0 0 ...
 $ Soil_Type31                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type32                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type33                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type34                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type35                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type36                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type37                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type38                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type39                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Soil_Type40                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Cover_Type                        : int  5 5 2 2 5 2 5 5 5 5 ...

2.2.1 Missing Data Imputation

# checking missing values
colSums(is.na(whole))

                                Id                          Elevation 
                                 0                                  0 
                            Aspect                              Slope 
                                 0                                  0 
  Horizontal_Distance_To_Hydrology     Vertical_Distance_To_Hydrology 
                                 0                                  0 
   Horizontal_Distance_To_Roadways                      Hillshade_9am 
                                 0                                  0 
                    Hillshade_Noon                      Hillshade_3pm 
                                 0                                  0 
Horizontal_Distance_To_Fire_Points                   Wilderness_Area1 
                                 0                                  0 
                  Wilderness_Area2                   Wilderness_Area3 
                                 0                                  0 
                  Wilderness_Area4                         Soil_Type1 
                                 0                                  0 
                        Soil_Type2                         Soil_Type3 
                                 0                                  0 
                        Soil_Type4                         Soil_Type5 
                                 0                                  0 
                        Soil_Type6                         Soil_Type7 
                                 0                                  0 
                        Soil_Type8                         Soil_Type9 
                                 0                                  0 
                       Soil_Type10                        Soil_Type11 
                                 0                                  0 
                       Soil_Type12                        Soil_Type13 
                                 0                                  0 
                       Soil_Type14                        Soil_Type15 
                                 0                                  0 
                       Soil_Type16                        Soil_Type17 
                                 0                                  0 
                       Soil_Type18                        Soil_Type19 
                                 0                                  0 
                       Soil_Type20                        Soil_Type21 
                                 0                                  0 
                       Soil_Type22                        Soil_Type23 
                                 0                                  0 
                       Soil_Type24                        Soil_Type25 
                                 0                                  0 
                       Soil_Type26                        Soil_Type27 
                                 0                                  0 
                       Soil_Type28                        Soil_Type29 
                                 0                                  0 
                       Soil_Type30                        Soil_Type31 
                                 0                                  0 
                       Soil_Type32                        Soil_Type33 
                                 0                                  0 
                       Soil_Type34                        Soil_Type35 
                                 0                                  0 
                       Soil_Type36                        Soil_Type37 
                                 0                                  0 
                       Soil_Type38                        Soil_Type39 
                                 0                                  0 
                       Soil_Type40                         Cover_Type 
                                 0                             565892

# visualization of missing values
#missmap(whole, main = "Missing values vs observed") running this line will lead to out of memory

Missing values of Covertype all come from the test data so we can just continue without deletion. Hence, it seems that all rows can be used

However, some columns have too many 0, which can result in subset full of 0 in the following split. So we still need to find these columns in advance.

2.2.2 Further cleaning and reorganization

# determing if every value in soil tyoe columns is 0
for (i in which(names(whole) == "Soil_Type1"):55) {print(sum(whole[,i]))}

[1] 3031
[1] 7525
[1] 4823
[1] 12396
[1] 1597
[1] 6575
[1] 105
[1] 179
[1] 1147
[1] 32634
[1] 12410
[1] 29971
[1] 17431
[1] 599
[1] 3
[1] 2845
[1] 3422
[1] 1899
[1] 4021
[1] 9259
[1] 838
[1] 33373
[1] 57752
[1] 21278
[1] 474
[1] 2589
[1] 1086
[1] 946
[1] 115247
[1] 30170
[1] 25666
[1] 52519
[1] 45154
[1] 1611
[1] 1891
[1] 119
[1] 298
[1] 15573
[1] 13806
[1] 8750

# integrating four Wilderness areas into a comprehensive new variable
numArea1 <- which(whole$Wilderness_Area1 == 1)
numArea2 <- which(whole$Wilderness_Area2 == 1)
numArea3 <- which(whole$Wilderness_Area3 == 1)
numArea4 <- which(whole$Wilderness_Area4 == 1)

whole$Wilderness_Area <- "None"

whole[numArea1, which(names(whole) == "Wilderness_Area")] <- 1
whole[numArea2, which(names(whole) == "Wilderness_Area")] <- 2
whole[numArea3, which(names(whole) == "Wilderness_Area")] <- 3
whole[numArea4, which(names(whole) == "Wilderness_Area")] <- 4

# integrating forty soil types into a comprehensive new variable
numSoil1 <- which(whole$Soil_Type1 == 1)
numSoil2 <- which(whole$Soil_Type2 == 1)
numSoil3 <- which(whole$Soil_Type3 == 1)
numSoil4 <- which(whole$Soil_Type4 == 1)
numSoil5 <- which(whole$Soil_Type5 == 1)
numSoil6 <- which(whole$Soil_Type6 == 1)
numSoil7 <- which(whole$Soil_Type7 == 1)
numSoil8 <- which(whole$Soil_Type8 == 1)
numSoil9 <- which(whole$Soil_Type9 == 1)
numSoil10 <- which(whole$Soil_Type10 == 1)
numSoil11 <- which(whole$Soil_Type11 == 1)
numSoil12 <- which(whole$Soil_Type12 == 1)
numSoil13 <- which(whole$Soil_Type13 == 1)
numSoil14 <- which(whole$Soil_Type14 == 1)
numSoil15 <- which(whole$Soil_Type15 == 1)
numSoil16 <- which(whole$Soil_Type16 == 1)
numSoil17 <- which(whole$Soil_Type17 == 1)
numSoil18 <- which(whole$Soil_Type18 == 1)
numSoil19 <- which(whole$Soil_Type19 == 1)
numSoil20 <- which(whole$Soil_Type20 == 1)
numSoil21 <- which(whole$Soil_Type21 == 1)
numSoil22 <- which(whole$Soil_Type22 == 1)
numSoil23 <- which(whole$Soil_Type23 == 1)
numSoil24 <- which(whole$Soil_Type24 == 1)
numSoil25 <- which(whole$Soil_Type25 == 1)
numSoil26 <- which(whole$Soil_Type26 == 1)
numSoil27 <- which(whole$Soil_Type27 == 1)
numSoil28 <- which(whole$Soil_Type28 == 1)
numSoil29 <- which(whole$Soil_Type29 == 1)
numSoil30 <- which(whole$Soil_Type30 == 1)
numSoil31 <- which(whole$Soil_Type31 == 1)
numSoil32 <- which(whole$Soil_Type32 == 1)
numSoil33 <- which(whole$Soil_Type33 == 1)
numSoil34 <- which(whole$Soil_Type34 == 1)
numSoil35 <- which(whole$Soil_Type35 == 1)
numSoil36 <- which(whole$Soil_Type36 == 1)
numSoil37 <- which(whole$Soil_Type37 == 1)
numSoil38 <- which(whole$Soil_Type38 == 1)
numSoil39 <- which(whole$Soil_Type39 == 1)
numSoil40 <- which(whole$Soil_Type40 == 1)

whole$Soil_Type <- 0

whole[numSoil1, which(names(whole) == "Soil_Type")] <- 1
whole[numSoil2, which(names(whole) == "Soil_Type")] <- 2
whole[numSoil3, which(names(whole) == "Soil_Type")] <- 3
whole[numSoil4, which(names(whole) == "Soil_Type")] <- 4
whole[numSoil5, which(names(whole) == "Soil_Type")] <- 5
whole[numSoil6, which(names(whole) == "Soil_Type")] <- 6
whole[numSoil7, which(names(whole) == "Soil_Type")] <- 7
whole[numSoil8, which(names(whole) == "Soil_Type")] <- 8
whole[numSoil9, which(names(whole) == "Soil_Type")] <- 9
whole[numSoil10, which(names(whole) == "Soil_Type")] <- 10
whole[numSoil11, which(names(whole) == "Soil_Type")] <- 11
whole[numSoil12, which(names(whole) == "Soil_Type")] <- 12
whole[numSoil13, which(names(whole) == "Soil_Type")] <- 13
whole[numSoil14, which(names(whole) == "Soil_Type")] <- 14
whole[numSoil15, which(names(whole) == "Soil_Type")] <- 15
whole[numSoil16, which(names(whole) == "Soil_Type")] <- 16
whole[numSoil17, which(names(whole) == "Soil_Type")] <- 17
whole[numSoil18, which(names(whole) == "Soil_Type")] <- 18
whole[numSoil19, which(names(whole) == "Soil_Type")] <- 19
whole[numSoil20, which(names(whole) == "Soil_Type")] <- 20
whole[numSoil21, which(names(whole) == "Soil_Type")] <- 21
whole[numSoil22, which(names(whole) == "Soil_Type")] <- 22
whole[numSoil23, which(names(whole) == "Soil_Type")] <- 23
whole[numSoil24, which(names(whole) == "Soil_Type")] <- 24
whole[numSoil25, which(names(whole) == "Soil_Type")] <- 25
whole[numSoil26, which(names(whole) == "Soil_Type")] <- 26
whole[numSoil27, which(names(whole) == "Soil_Type")] <- 27
whole[numSoil28, which(names(whole) == "Soil_Type")] <- 28
whole[numSoil29, which(names(whole) == "Soil_Type")] <- 29
whole[numSoil30, which(names(whole) == "Soil_Type")] <- 30
whole[numSoil31, which(names(whole) == "Soil_Type")] <- 31
whole[numSoil32, which(names(whole) == "Soil_Type")] <- 32
whole[numSoil33, which(names(whole) == "Soil_Type")] <- 33
whole[numSoil34, which(names(whole) == "Soil_Type")] <- 34
whole[numSoil35, which(names(whole) == "Soil_Type")] <- 35
whole[numSoil36, which(names(whole) == "Soil_Type")] <- 36
whole[numSoil37, which(names(whole) == "Soil_Type")] <- 37
whole[numSoil38, which(names(whole) == "Soil_Type")] <- 38
whole[numSoil39, which(names(whole) == "Soil_Type")] <- 39
whole[numSoil40, which(names(whole) == "Soil_Type")] <- 40

# deleting columns of ID、Wilderness_Area1-4、Soil_Type
train_original = whole[1:15120,
                       c(-which(names(whole) =="Id"),
                         -which(names(whole) =="Wilderness_Area1"),
                         -which(names(whole) =="Wilderness_Area2"),
                         -which(names(whole) =="Wilderness_Area3"),
                         -which(names(whole) =="Wilderness_Area4"),
                         -which(names(whole) =="Soil_Type")
                         )]

test_original = whole[15121:581012,
                      c(-which(names(whole) =="Id"),
                        -which(names(whole) =="Wilderness_Area1"),
                        -which(names(whole) =="Wilderness_Area2"),
                        -which(names(whole) =="Wilderness_Area3"),
                        -which(names(whole) =="Wilderness_Area4"),
                        -which(names(whole) =="Soil_Type")
                        )]

# encoding the categorical features as factors
whole$Wilderness_Area <- factor(whole$Wilderness_Area)
head(whole$Wilderness_Area)

[1] 1 1 1 1 1 1
Levels: 1 2 3 4

whole$Soil_Type <- factor(whole$Soil_Type)
head(whole$Soil_Type)

[1] 29 29 12 30 29 29
40 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 40

3 EDA

str(whole)

ggplot(filter(whole, is.na(Cover_Type)==FALSE), aes(x=Aspect)) + 
    geom_point(aes(y=Hillshade_9am, color="Hillshade_9am"), alpha=.1) +
    geom_point(aes(y=Hillshade_3pm, color="Hillshade_3pm"), alpha=.1)

# Exploratory data analysis on relationship be Cover_Type and Wilderness_Area
ggplot(filter(whole, is.na(Cover_Type)==FALSE),
       aes(Wilderness_Area, Cover_Type)
       ) +
  geom_boxplot(aes(col = Wilderness_Area)
               ) +
  theme_bw() +
  ggtitle("Cover_type based on Wilderness_Area")

#  Exploratory Data Analysis on Cover_Type and Soil_Type
ggplot(filter(whole, is.na(Cover_Type)==FALSE),
       aes(Soil_Type, Cover_Type)
       ) +
  geom_boxplot(aes(col = Soil_Type)) +
  theme_bw() +
  ggtitle("Cover_type based on Soil_Type")

It can be seen that soil type whose index numbers are closing also have similar relationship with cover types.

# checking correlation of numeric variables
train_num = select_if(train, is.numeric)
#correlation matric & shrinking the size of labels
corrplot(cor(train_num),tl.cex=0.5)

Warning in cor(train_num): 标准差为零

Here we find the columns full of 0. We will delete them later.

4 Cross Validation

# splitting the training set into the training set and validation set
set.seed(789)
split = sample.split(train_original$Cover_Type, SplitRatio = 0.8)
train = subset(train_original, split == TRUE)
validation = subset(train_original, split == FALSE)

Another correlation to see the correlation involving categorical variables.

library(polycor)
#hetcor(train)

#cross validation
set.seed(123)
train.control = trainControl(method = "repeatedcv", number =10, repeats=3)

# checking if every value in soil tyoe columns is 0 after split
for (i in which(names(train) == "Soil_Type1"):51) {print(sum(train[,i]))}

# deleting columns full of value 0
train <- train[,c(-which(names(train) == "Soil_Type7"),
                  -which(names(train) == "Soil_Type15")
                  )
               ]
head(train)

  Elevation Aspect Slope Horizontal_Distance_To_Hydrology
1      2596     51     3                              258
2      2590     56     2                              212
3      2804    139     9                              268
4      2785    155    18                              242
5      2595     45     2                              153
6      2579    132     6                              300
  Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
1                              0                             510
2                             -6                             390
3                             65                            3180
4                            118                            3090
5                             -1                             391
6                            -15                              67
  Hillshade_9am Hillshade_Noon Hillshade_3pm
1           221            232           148
2           220            235           151
3           234            238           135
4           238            238           122
5           220            234           150
6           230            237           140
  Horizontal_Distance_To_Fire_Points Soil_Type1 Soil_Type2 Soil_Type3
1                               6279          0          0          0
2                               6225          0          0          0
3                               6121          0          0          0
4                               6211          0          0          0
5                               6172          0          0          0
6                               6031          0          0          0
  Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type8 Soil_Type9 Soil_Type10
1          0          0          0          0          0           0
2          0          0          0          0          0           0
3          0          0          0          0          0           0
4          0          0          0          0          0           0
5          0          0          0          0          0           0
6          0          0          0          0          0           0
  Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type16 Soil_Type17
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           1           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21 Soil_Type22 Soil_Type23
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           0           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27 Soil_Type28 Soil_Type29
1           0           0           0           0           0           1
2           0           0           0           0           0           1
3           0           0           0           0           0           0
4           0           0           0           0           0           0
5           0           0           0           0           0           1
6           0           0           0           0           0           1
  Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35
1           0           0           0           0           0           0
2           0           0           0           0           0           0
3           0           0           0           0           0           0
4           1           0           0           0           0           0
5           0           0           0           0           0           0
6           0           0           0           0           0           0
  Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
1           0           0           0           0           0          5
2           0           0           0           0           0          5
3           0           0           0           0           0          2
4           0           0           0           0           0          2
5           0           0           0           0           0          5
6           0           0           0           0           0          2
  Wilderness_Area
1               1
2               1
3               1
4               1
5               1
6               1

# checking the correlation
train_num = select_if(train, is.numeric)
corrplot(cor(train_num),tl.cex=0.5)

5 t-test

t.test(Elevation ~ Soil_Type1,
       mu = 0,
       alt = "two.sided",
       conf = 0.95,
       data=train)


    Welch Two Sample t-test

data:  Elevation by Soil_Type1
t = 66.578, df = 409.56, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 587.9997 623.7783
sample estimates:
mean in group 0 mean in group 1 
       2764.166        2158.277

It is self-evident that there is a strong relationship between soil type and elevation.

6 Model building

6.1 LDA

# lda model building and CV
model_lda = train(factor(Cover_Type) ~ .,
                  data=train,
                  method="lda",
                  trControl = train.control
                  )

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold06.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold09.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold04.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold07.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold07.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 17 appears to be constant within groups

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning: model fit failed for Fold10.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
  variable 33 appears to be constant within groups

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
trainInfo, : There were missing values in resampled performance measures.

Warning in lda.default(x, grouping, ...): variables are collinear

summary(model_lda)

            Length Class      Mode     
prior         7    -none-     numeric  
counts        7    -none-     numeric  
means       357    -none-     numeric  
scaling     306    -none-     numeric  
lev           7    -none-     character
svd           6    -none-     numeric  
N             1    -none-     numeric  
call          3    -none-     call     
xNames       51    -none-     character
problemType   1    -none-     character
tuneValue     1    data.frame list     
obsLevels     7    -none-     character
param         0    -none-     list

print(model_lda)

Linear Discriminant Analysis 

12096 samples
   49 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 10886, 10885, 10888, 10887, 10886, 10886, ... 
Resampling results:

  Accuracy   Kappa    
  0.6465129  0.5875977

6.2 Naive Bayes

I use smaller subsets of the original dataset or comment out some lines as the output of html is too too too slow. However, I’ve run a complete model before, so the following comparison is based on that big model.

# make a small subset of train and validation because of the limited computing power
temp_train <- train[1:200,]
validation_temp <- validation[1:50,]

# naive bayes model building and CV
model_nb = train(factor(Cover_Type) ~ .,
                  data=temp_train,
                  method="nb",
                  #trControl = train.control
                  )

6.3 k-NearestNeighbor

# Fitting kNN Model to the Training set
classifier_knn <- train(factor(Cover_Type) ~.,
                        data=train,
                        method="knn",
                        trControl=train.control,
                        tuneGrid = expand.grid(k = 1:20) 
                        # can also use tunelength
                        )

# visualization of kNN model
plot(classifier_knn)

6.4 Decision Tree

6.4.1 Classifier Tree 1

# Fitting Decision Tree Classification Model to the Training set
classifier_tree <- train(factor(Cover_Type) ~.,
                        data = train,
                        method = "rpart",
                        trControl=train.control,
                        tuneLength = 100
                        )

# Tree Visualization
plot(classifier_tree$finalModel)
text(classifier_tree$finalModel)

prp(classifier_tree$finalModel, box.palette = "Reds", tweak = 1.2)

Warning: labs do not fit even at cex 0.15, there may be some overplotting

print(classifier_tree)

CART 

12096 samples
   49 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 10886, 10886, 10888, 10886, 10886, 10886, ... 
Resampling results across tuning parameters:

  cp           Accuracy   Kappa      
  0.000000000  0.7717194  0.733672563
  0.001683502  0.7130987  0.665281903
  0.003367003  0.6594155  0.602650911
  0.005050505  0.6372051  0.576738573
  0.006734007  0.6309780  0.569472556
  0.008417508  0.6283895  0.566452139
  0.010101010  0.6228220  0.559956509
  0.011784512  0.6161248  0.552142184
  0.013468013  0.6143892  0.550117813
  0.015151515  0.6143892  0.550117813
  0.016835017  0.6143892  0.550117813
  0.018518519  0.6143892  0.550117813
  0.020202020  0.6091527  0.544008185
  0.021885522  0.5893140  0.520863516
  0.023569024  0.5792819  0.509156570
  0.025252525  0.5652540  0.492793748
  0.026936027  0.5585867  0.485017890
  0.028619529  0.5585867  0.485017890
  0.030303030  0.5585867  0.485017890
  0.031986532  0.5585867  0.485017890
  0.033670034  0.5585867  0.485017890
  0.035353535  0.5585867  0.485017890
  0.037037037  0.5585867  0.485017890
  0.038720539  0.5585867  0.485017890
  0.040404040  0.5585867  0.485017890
  0.042087542  0.5585867  0.485017890
  0.043771044  0.5585867  0.485017890
  0.045454545  0.5585867  0.485017890
  0.047138047  0.5585867  0.485017890
  0.048821549  0.5585867  0.485017890
  0.050505051  0.5585867  0.485017890
  0.052188552  0.5585867  0.485017890
  0.053872054  0.5585867  0.485017890
  0.055555556  0.5585867  0.485017890
  0.057239057  0.5585867  0.485017890
  0.058922559  0.5585867  0.485017890
  0.060606061  0.5585867  0.485017890
  0.062289562  0.5585867  0.485017890
  0.063973064  0.5585867  0.485017890
  0.065656566  0.5585867  0.485017890
  0.067340067  0.5585867  0.485017890
  0.069023569  0.5585867  0.485017890
  0.070707071  0.5585867  0.485017890
  0.072390572  0.5585867  0.485017890
  0.074074074  0.5585867  0.485017890
  0.075757576  0.5585867  0.485017890
  0.077441077  0.5585867  0.485017890
  0.079124579  0.5585867  0.485017890
  0.080808081  0.5585867  0.485017890
  0.082491582  0.5536787  0.479291553
  0.084175084  0.5346592  0.457101785
  0.085858586  0.5150726  0.434253578
  0.087542088  0.4866347  0.401073020
  0.089225589  0.4839922  0.397990169
  0.090909091  0.4788365  0.391970848
  0.092592593  0.4788365  0.391970848
  0.094276094  0.4788365  0.391970848
  0.095959596  0.4701541  0.381844445
  0.097643098  0.4217501  0.325375967
  0.099326599  0.4155718  0.318163962
  0.101010101  0.4044860  0.305233095
  0.102693603  0.4044860  0.305233095
  0.104377104  0.4044860  0.305233095
  0.106060606  0.4044860  0.305233095
  0.107744108  0.4044860  0.305233095
  0.109427609  0.4044860  0.305233095
  0.111111111  0.4044860  0.305233095
  0.112794613  0.4044860  0.305233095
  0.114478114  0.4044860  0.305233095
  0.116161616  0.4044860  0.305233095
  0.117845118  0.4044860  0.305233095
  0.119528620  0.4044860  0.305233095
  0.121212121  0.4044860  0.305233095
  0.122895623  0.4044860  0.305233095
  0.124579125  0.4044860  0.305233095
  0.126262626  0.4044860  0.305233095
  0.127946128  0.4044860  0.305233095
  0.129629630  0.4044860  0.305233095
  0.131313131  0.4044860  0.305233095
  0.132996633  0.4044860  0.305233095
  0.134680135  0.4044860  0.305233095
  0.136363636  0.4005401  0.300627620
  0.138047138  0.3810510  0.277889517
  0.139730640  0.3294742  0.217720655
  0.141414141  0.3098534  0.194831618
  0.143097643  0.2939547  0.176282453
  0.144781145  0.2857143  0.166666692
  0.146464646  0.2857143  0.166666692
  0.148148148  0.2857143  0.166666692
  0.149831650  0.2857143  0.166666692
  0.151515152  0.2857143  0.166666692
  0.153198653  0.2857143  0.166666692
  0.154882155  0.2857143  0.166666692
  0.156565657  0.2857143  0.166666692
  0.158249158  0.2857143  0.166666692
  0.159932660  0.2857143  0.166666692
  0.161616162  0.2857143  0.166666692
  0.163299663  0.2857143  0.166666692
  0.164983165  0.2857143  0.166666692
  0.166666667  0.1471308  0.005528769

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

# checking the contribution of each variable
plot(varImp(classifier_tree))

6.4.2 Try to improve the Classifier Tree

train.control = trainControl(method = "repeatedcv",
                             number =10,
                             repeats=1
                             )
classifier_tree2 = train(factor(Cover_Type) ~ Elevation *
                            Horizontal_Distance_To_Roadways +
                            Horizontal_Distance_To_Fire_Points +
                            Horizontal_Distance_To_Hydrology +
                            Vertical_Distance_To_Hydrology +
                            Hillshade_9am +
                            Soil_Type3 +
                            Soil_Type10 +
                            Hillshade_Noon +
                            Hillshade_3pm +
                            Soil_Type38 +
                            Soil_Type39 +
                            Soil_Type4 +
                            Soil_Type40 +
                            Soil_Type12 +
                            Soil_Type32 +
                            Soil_Type30 +
                            Soil_Type29 +
                            Wilderness_Area +
                            Horizontal_Distance_To_Hydrology +
                            Aspect,
                        data = train, method = "rpart",
                        trControl=train.control,
                        tuneLength = 100
                        )

prp(classifier_tree2$finalModel,
    box.palette = "Reds",
    tweak = 1.2
    )

Warning: labs do not fit even at cex 0.15, there may be some overplotting

print(classifier_tree2)

CART 

12096 samples
   20 predictor
    7 classes: '1', '2', '3', '4', '5', '6', '7' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 10885, 10887, 10886, 10886, 10886, 10887, ... 
Resampling results across tuning parameters:

  cp           Accuracy   Kappa    
  0.000000000  0.7628125  0.7232804
  0.001683502  0.7115559  0.6634826
  0.003367003  0.6550894  0.5976026
  0.005050505  0.6449210  0.5857416
  0.006734007  0.6310333  0.5695378
  0.008417508  0.6292977  0.5675129
  0.010101010  0.6221046  0.5591208
  0.011784512  0.6154099  0.5513118
  0.013468013  0.6141696  0.5498629
  0.015151515  0.6141696  0.5498629
  0.016835017  0.6141696  0.5498629
  0.018518519  0.6141696  0.5498629
  0.020202020  0.6095436  0.5444658
  0.021885522  0.5929237  0.5250756
  0.023569024  0.5774631  0.5070431
  0.025252525  0.5631631  0.4903568
  0.026936027  0.5596076  0.4862085
  0.028619529  0.5596076  0.4862085
  0.030303030  0.5596076  0.4862085
  0.031986532  0.5596076  0.4862085
  0.033670034  0.5596076  0.4862085
  0.035353535  0.5596076  0.4862085
  0.037037037  0.5596076  0.4862085
  0.038720539  0.5596076  0.4862085
  0.040404040  0.5596076  0.4862085
  0.042087542  0.5596076  0.4862085
  0.043771044  0.5596076  0.4862085
  0.045454545  0.5596076  0.4862085
  0.047138047  0.5596076  0.4862085
  0.048821549  0.5596076  0.4862085
  0.050505051  0.5596076  0.4862085
  0.052188552  0.5596076  0.4862085
  0.053872054  0.5596076  0.4862085
  0.055555556  0.5596076  0.4862085
  0.057239057  0.5596076  0.4862085
  0.058922559  0.5596076  0.4862085
  0.060606061  0.5596076  0.4862085
  0.062289562  0.5596076  0.4862085
  0.063973064  0.5596076  0.4862085
  0.065656566  0.5596076  0.4862085
  0.067340067  0.5596076  0.4862085
  0.069023569  0.5596076  0.4862085
  0.070707071  0.5596076  0.4862085
  0.072390572  0.5596076  0.4862085
  0.074074074  0.5596076  0.4862085
  0.075757576  0.5596076  0.4862085
  0.077441077  0.5596076  0.4862085
  0.079124579  0.5596076  0.4862085
  0.080808081  0.5596076  0.4862085
  0.082491582  0.5526712  0.4781160
  0.084175084  0.5385272  0.4616040
  0.085858586  0.5228183  0.4432765
  0.087542088  0.4876821  0.4022962
  0.089225589  0.4876821  0.4022962
  0.090909091  0.4876821  0.4022962
  0.092592593  0.4876821  0.4022962
  0.094276094  0.4876821  0.4022962
  0.095959596  0.4702369  0.3819302
  0.097643098  0.4213806  0.3249348
  0.099326599  0.4125303  0.3146066
  0.101010101  0.4045964  0.3053619
  0.102693603  0.4045964  0.3053619
  0.104377104  0.4045964  0.3053619
  0.106060606  0.4045964  0.3053619
  0.107744108  0.4045964  0.3053619
  0.109427609  0.4045964  0.3053619
  0.111111111  0.4045964  0.3053619
  0.112794613  0.4045964  0.3053619
  0.114478114  0.4045964  0.3053619
  0.116161616  0.4045964  0.3053619
  0.117845118  0.4045964  0.3053619
  0.119528620  0.4045964  0.3053619
  0.121212121  0.4045964  0.3053619
  0.122895623  0.4045964  0.3053619
  0.124579125  0.4045964  0.3053619
  0.126262626  0.4045964  0.3053619
  0.127946128  0.4045964  0.3053619
  0.129629630  0.4045964  0.3053619
  0.131313131  0.4045964  0.3053619
  0.132996633  0.4045964  0.3053619
  0.134680135  0.4045964  0.3053619
  0.136363636  0.4045964  0.3053619
  0.138047138  0.3803710  0.2770475
  0.139730640  0.3215085  0.2083673
  0.141414141  0.2857143  0.1666670
  0.143097643  0.2857143  0.1666670
  0.144781145  0.2857143  0.1666670
  0.146464646  0.2857143  0.1666670
  0.148148148  0.2857143  0.1666670
  0.149831650  0.2857143  0.1666670
  0.151515152  0.2857143  0.1666670
  0.153198653  0.2857143  0.1666670
  0.154882155  0.2857143  0.1666670
  0.156565657  0.2857143  0.1666670
  0.158249158  0.2857143  0.1666670
  0.159932660  0.2857143  0.1666670
  0.161616162  0.2857143  0.1666670
  0.163299663  0.2857143  0.1666670
  0.164983165  0.2857143  0.1666670
  0.166666667  0.1422784  0.0000000

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

6.5 Random forest

# Fitting Random Forest Classification Model to the Training set
classifier_forest = train(factor(Cover_Type) ~ Elevation +
                            Horizontal_Distance_To_Roadways +
                            Horizontal_Distance_To_Fire_Points +
                            Horizontal_Distance_To_Hydrology +
                            Hillshade_9am +
                            Soil_Type3 +
                            Soil_Type10,
                        data = train,
                        method = "rf")
                        #trControl=train.control)

# Forest Visualization
plot(classifier_forest)

set.seed(567)
classifier = randomForest(factor(Cover_Type) ~ Elevation +
                            Horizontal_Distance_To_Roadways +
                            Horizontal_Distance_To_Fire_Points +
                            Horizontal_Distance_To_Hydrology +
                            Hillshade_9am +
                            Soil_Type3 +
                            Soil_Type10,
                          data = train)

# Choosing the number of trees
plot(classifier)

The Random Forest model still has potential as I only choose 7 important features among about 50 variables because of the limited power of CPU. But it still wins the Decision Tree model.

6.6 Boosted Trees

# Fitting Boosted Trees Classification Model to the Training set
classifier_btree <- train(factor(Cover_Type) ~ Elevation +
                              Horizontal_Distance_To_Roadways +
                              Horizontal_Distance_To_Fire_Points +
                              Horizontal_Distance_To_Hydrology +
                              Hillshade_9am +
                              Soil_Type3 +
                              Soil_Type10,
                        data = temp_train,
                        method = "gbm",
                        trControl=train.control#,
                        #tuneLength = 100
                        )

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.

Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.

6.7 Prediction

6.7.1 On validation set

# prediction
y_pred_lda = predict(model_lda, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_lda) # Confusion matrix

   y_pred_lda
      1   2   3   4   5   6   7
  1 292  81   1   0  15   2  41
  2  99 219  11   0  79  21   3
  3   0   2 228  46  17 139   0
  4   0   0  53 334   0  45   0
  5  11  82  34   0 288  17   0
  6   0  16 106  25  28 257   0
  7  85   0   2   0   1   0 344

error_lda <- mean(validation$Cover_Type != y_pred_lda) # Misclassification error
paste('Accuracy',round(1-error_lda,4))

[1] "Accuracy 0.6488"

# prediction
y_pred_nb = predict(model_nb, newdata = validation_temp)
# Checking the prediction accuracy
table(validation_temp$Cover_Type, y_pred_nb) # Confusion matrix
error_nb <- mean(validation_temp$Cover_Type != y_pred_nb) # Misclassification error
paste('Accuracy',round(1-error_nb,4))

“Accuracy 0.64”

Althou in this demo, the NB model seems so-so, the complete model I run before actually gave a very bad prediction about 0.2. One of the reason may be the NB model assumes independence among variables. This apparently conflicts with our data.

# prediction
y_pred_knn = predict(classifier_knn, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_knn) # Confusion matrix

   y_pred_knn
      1   2   3   4   5   6   7
  1 303  67   0   0  16   1  45
  2  81 262  16   0  41  20  12
  3   0  11 348  17   6  50   0
  4   0   0   3 419   0  10   0
  5   3   9   4   0 414   2   0
  6   1   5  29  19   3 375   0
  7   7   2   0   0   0   0 423

error_knn <- mean(validation$Cover_Type != y_pred_knn) # Misclassification error
paste('Accuracy',round(1-error_knn,4))

[1] "Accuracy 0.8413"

# prediction
y_pred_tree = predict(classifier_tree, newdata = validation)

# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_tree) # Confusion matrix

   y_pred_tree
      1   2   3   4   5   6   7
  1 290  90   1   0   9   3  39
  2  95 238  16   0  56  23   4
  3   0   6 306  27   9  84   0
  4   0   0  15 412   0   5   0
  5   3  26   8   0 387   8   0
  6   1   1 100  29  10 291   0
  7  31   3   0   0   1   0 397

error_tree <- mean(validation$Cover_Type != y_pred_tree) # Misclassification error
paste('Accuracy',round(1-error_tree,4))

[1] "Accuracy 0.7675"

# prediction of tree2
y_pred_tree2 = predict(classifier_tree2, newdata = validation)

# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_tree2) # Confusion matrix

   y_pred_tree2
      1   2   3   4   5   6   7
  1 286 100   1   0  10   4  31
  2  93 233  14   0  63  22   7
  3   2   5 320  19   7  79   0
  4   0   0  21 404   0   7   0
  5   4  26   5   0 390   7   0
  6   2   2  89  30  13 296   0
  7  36   6   0   0   1   0 389

error_tree2 <- mean(validation$Cover_Type != y_pred_tree2) # Misclassification error
paste('Accuracy',round(1-error_tree2,4))

[1] "Accuracy 0.7665"

# prediction on random forest
y_pred_rf = predict(classifier_forest, newdata = validation)

# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_rf) # Confusion matrix

   y_pred_rf
      1   2   3   4   5   6   7
  1 320  58   0   0  11   3  40
  2  92 254  19   0  43  17   7
  3   0  10 352  14   5  51   0
  4   0   0   7 424   0   1   0
  5   1   6   6   0 417   2   0
  6   1   0  49  17   3 362   0
  7  10   1   0   0   2   0 419

error_rf <- mean(validation$Cover_Type != y_pred_rf) # Misclassification error
paste('Accuracy',round(1-error_rf,4))

[1] "Accuracy 0.8426"

y_pred_btree = predict(classifier_btree, newdata = validation_temp)

# Checking the prediction accuracy
table(validation_temp$Cover_Type, y_pred_btree) # Confusion matrix

   y_pred_btree
     1  2  5
  1  5 10  0
  2  5 26  1
  5  0  0  3

error_btree <- mean(validation_temp$Cover_Type != y_pred_btree) # Misclassification error
paste('Accuracy',round(1-error_btree,4))

[1] "Accuracy 0.68"

Random forest combines the output of multiple decision trees which are somehow “biased” in evaluation. Thus the RF model improves a lot on the foundation of Decision Trees.

6.7.2 On test set

To make the output of html faster, I comment out these lines about writing csv files.

test_original$Cover_Type = predict(model_lda, newdata = test_original)
write.csv(test_original, file = "predicted_lda.csv")

6.7.2.1 kaggle score: 0.58346

test_original$Cover_Type = predict(model_nb, newdata = test_original)
write.csv(test_original, file = "predicted_nb.csv")

test_original$Cover_Type = predict(classifier_knn, newdata = test_original)
write.csv(test_original, file = "predicted_knn.csv")

6.7.2.2 kaggle score 0.68723

test_original$Cover_Type = predict(classifier_tree, newdata = test_original)
write.csv(test_original, file = "predicted_decision_tree.csv")

6.7.2.3 kaggle Score 0.63211

test_original$Cover_Type = predict(classifier_forest, newdata = test_original)
write.csv(test_original, file = "predicted_random_forest.csv")

6.7.2.4 kaggle score 0.6884

# boosted trees

There is no kaggle score of boosted trees as the above model is just a demo on a limited dataset because of the lack of computing power on the laptop. However, the performance has already been really good on the subset, which is far beyond a random guess among seven types. Thus, we can confidently estimate that the boosted trees model will also give a good performance with the whole training data.

7 Brief Summary

7.1 A Comparison Table of the Quality of Models

Id	Model	Score
1	LDA	0.58346
2	Naive Bayes	too slow to predict
3	kNN	0.68723
4	Decision Tree	0.63211
5	Random Forest	0.68840
6	Bossted Trees	lack

To sum up, the Random Forest model is the optimal model among all six models. Although some models can get an accuracy around 0.8 on validation set, they cannot go beyond 0.7 when evaluated by kaggle system. This may due to the much larger size of test dataset than the training dataset.

Report about Classification of Forest Cover Type

August 7, 2019