In this report I use the dataset from a kaggle challenge Forest Cover Type Prediction to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables.
These independent variables were erived from data obtained from the US Geological Survey and USFS.
Six models are involved in this report. They are LDA (Linear Discriminant Analysis), Naive Bayes, kNN (k-NearestNeighbor), Decision Trees, Random Forest and Boosted Trees.
The report can also be used as a step-by-step tutorial for selecting features, cross validation, building models using primary machine learning algorithms and a beginning of kaggle challenges.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Attaching package: 'GGally'
The following object is masked from 'package:dplyr':
nasa
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
library(scales) # Visualization
library(caTools) # Prediction: Splitting Data
library(car) # Prediction: Checking Multicollinearity
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
library(e1071) # Prediction: SVM, Naive Bayes, Parameter Tuning
library(rpart) # Prediction: Decision Tree
library(rpart.plot) # Prediction: Decision Tree
library(randomForest) # Prediction: Random Forest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
Loading required package: lattice
Version: 1.36.23
Date: 2017-03-03
Author: Philip Leifeld (University of Glasgow)
Please cite the JSS article in your publications -- see citation("texreg").
corrplot 0.84 loaded
# creating a new data set including both train and test sets
whole = bind_rows(train, test)
# checking the info about the whole data
str(whole)
'data.frame': 581012 obs. of 56 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Elevation : int 2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
$ Aspect : int 51 56 139 155 45 132 45 49 45 59 ...
$ Slope : int 3 2 9 18 2 6 7 4 9 10 ...
$ Horizontal_Distance_To_Hydrology : int 258 212 268 242 153 300 270 234 240 247 ...
$ Vertical_Distance_To_Hydrology : int 0 -6 65 118 -1 -15 5 7 56 11 ...
$ Horizontal_Distance_To_Roadways : int 510 390 3180 3090 391 67 633 573 666 636 ...
$ Hillshade_9am : int 221 220 234 238 220 230 222 222 223 228 ...
$ Hillshade_Noon : int 232 235 238 238 234 237 225 230 221 219 ...
$ Hillshade_3pm : int 148 151 135 122 150 140 138 144 133 124 ...
$ Horizontal_Distance_To_Fire_Points: int 6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
$ Wilderness_Area1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ Wilderness_Area2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wilderness_Area3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wilderness_Area4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type7 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type8 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type9 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type10 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type11 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type12 : int 0 0 1 0 0 0 0 0 0 0 ...
$ Soil_Type13 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type14 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type15 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type16 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type17 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type18 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type19 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type20 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type21 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type22 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type23 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type24 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type25 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type26 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type27 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type28 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type29 : int 1 1 0 0 1 1 1 1 1 1 ...
$ Soil_Type30 : int 0 0 0 1 0 0 0 0 0 0 ...
$ Soil_Type31 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type32 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type33 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type34 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type35 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type36 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type37 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type38 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type39 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Soil_Type40 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Cover_Type : int 5 5 2 2 5 2 5 5 5 5 ...
Id Elevation
0 0
Aspect Slope
0 0
Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology
0 0
Horizontal_Distance_To_Roadways Hillshade_9am
0 0
Hillshade_Noon Hillshade_3pm
0 0
Horizontal_Distance_To_Fire_Points Wilderness_Area1
0 0
Wilderness_Area2 Wilderness_Area3
0 0
Wilderness_Area4 Soil_Type1
0 0
Soil_Type2 Soil_Type3
0 0
Soil_Type4 Soil_Type5
0 0
Soil_Type6 Soil_Type7
0 0
Soil_Type8 Soil_Type9
0 0
Soil_Type10 Soil_Type11
0 0
Soil_Type12 Soil_Type13
0 0
Soil_Type14 Soil_Type15
0 0
Soil_Type16 Soil_Type17
0 0
Soil_Type18 Soil_Type19
0 0
Soil_Type20 Soil_Type21
0 0
Soil_Type22 Soil_Type23
0 0
Soil_Type24 Soil_Type25
0 0
Soil_Type26 Soil_Type27
0 0
Soil_Type28 Soil_Type29
0 0
Soil_Type30 Soil_Type31
0 0
Soil_Type32 Soil_Type33
0 0
Soil_Type34 Soil_Type35
0 0
Soil_Type36 Soil_Type37
0 0
Soil_Type38 Soil_Type39
0 0
Soil_Type40 Cover_Type
0 565892
# visualization of missing values
#missmap(whole, main = "Missing values vs observed") running this line will lead to out of memory
Missing values of Covertype all come from the test data so we can just continue without deletion. Hence, it seems that all rows can be used
However, some columns have too many 0, which can result in subset full of 0 in the following split. So we still need to find these columns in advance.
# determing if every value in soil tyoe columns is 0
for (i in which(names(whole) == "Soil_Type1"):55) {print(sum(whole[,i]))}
[1] 3031
[1] 7525
[1] 4823
[1] 12396
[1] 1597
[1] 6575
[1] 105
[1] 179
[1] 1147
[1] 32634
[1] 12410
[1] 29971
[1] 17431
[1] 599
[1] 3
[1] 2845
[1] 3422
[1] 1899
[1] 4021
[1] 9259
[1] 838
[1] 33373
[1] 57752
[1] 21278
[1] 474
[1] 2589
[1] 1086
[1] 946
[1] 115247
[1] 30170
[1] 25666
[1] 52519
[1] 45154
[1] 1611
[1] 1891
[1] 119
[1] 298
[1] 15573
[1] 13806
[1] 8750
# integrating four Wilderness areas into a comprehensive new variable
numArea1 <- which(whole$Wilderness_Area1 == 1)
numArea2 <- which(whole$Wilderness_Area2 == 1)
numArea3 <- which(whole$Wilderness_Area3 == 1)
numArea4 <- which(whole$Wilderness_Area4 == 1)
whole$Wilderness_Area <- "None"
whole[numArea1, which(names(whole) == "Wilderness_Area")] <- 1
whole[numArea2, which(names(whole) == "Wilderness_Area")] <- 2
whole[numArea3, which(names(whole) == "Wilderness_Area")] <- 3
whole[numArea4, which(names(whole) == "Wilderness_Area")] <- 4
# integrating forty soil types into a comprehensive new variable
numSoil1 <- which(whole$Soil_Type1 == 1)
numSoil2 <- which(whole$Soil_Type2 == 1)
numSoil3 <- which(whole$Soil_Type3 == 1)
numSoil4 <- which(whole$Soil_Type4 == 1)
numSoil5 <- which(whole$Soil_Type5 == 1)
numSoil6 <- which(whole$Soil_Type6 == 1)
numSoil7 <- which(whole$Soil_Type7 == 1)
numSoil8 <- which(whole$Soil_Type8 == 1)
numSoil9 <- which(whole$Soil_Type9 == 1)
numSoil10 <- which(whole$Soil_Type10 == 1)
numSoil11 <- which(whole$Soil_Type11 == 1)
numSoil12 <- which(whole$Soil_Type12 == 1)
numSoil13 <- which(whole$Soil_Type13 == 1)
numSoil14 <- which(whole$Soil_Type14 == 1)
numSoil15 <- which(whole$Soil_Type15 == 1)
numSoil16 <- which(whole$Soil_Type16 == 1)
numSoil17 <- which(whole$Soil_Type17 == 1)
numSoil18 <- which(whole$Soil_Type18 == 1)
numSoil19 <- which(whole$Soil_Type19 == 1)
numSoil20 <- which(whole$Soil_Type20 == 1)
numSoil21 <- which(whole$Soil_Type21 == 1)
numSoil22 <- which(whole$Soil_Type22 == 1)
numSoil23 <- which(whole$Soil_Type23 == 1)
numSoil24 <- which(whole$Soil_Type24 == 1)
numSoil25 <- which(whole$Soil_Type25 == 1)
numSoil26 <- which(whole$Soil_Type26 == 1)
numSoil27 <- which(whole$Soil_Type27 == 1)
numSoil28 <- which(whole$Soil_Type28 == 1)
numSoil29 <- which(whole$Soil_Type29 == 1)
numSoil30 <- which(whole$Soil_Type30 == 1)
numSoil31 <- which(whole$Soil_Type31 == 1)
numSoil32 <- which(whole$Soil_Type32 == 1)
numSoil33 <- which(whole$Soil_Type33 == 1)
numSoil34 <- which(whole$Soil_Type34 == 1)
numSoil35 <- which(whole$Soil_Type35 == 1)
numSoil36 <- which(whole$Soil_Type36 == 1)
numSoil37 <- which(whole$Soil_Type37 == 1)
numSoil38 <- which(whole$Soil_Type38 == 1)
numSoil39 <- which(whole$Soil_Type39 == 1)
numSoil40 <- which(whole$Soil_Type40 == 1)
whole$Soil_Type <- 0
whole[numSoil1, which(names(whole) == "Soil_Type")] <- 1
whole[numSoil2, which(names(whole) == "Soil_Type")] <- 2
whole[numSoil3, which(names(whole) == "Soil_Type")] <- 3
whole[numSoil4, which(names(whole) == "Soil_Type")] <- 4
whole[numSoil5, which(names(whole) == "Soil_Type")] <- 5
whole[numSoil6, which(names(whole) == "Soil_Type")] <- 6
whole[numSoil7, which(names(whole) == "Soil_Type")] <- 7
whole[numSoil8, which(names(whole) == "Soil_Type")] <- 8
whole[numSoil9, which(names(whole) == "Soil_Type")] <- 9
whole[numSoil10, which(names(whole) == "Soil_Type")] <- 10
whole[numSoil11, which(names(whole) == "Soil_Type")] <- 11
whole[numSoil12, which(names(whole) == "Soil_Type")] <- 12
whole[numSoil13, which(names(whole) == "Soil_Type")] <- 13
whole[numSoil14, which(names(whole) == "Soil_Type")] <- 14
whole[numSoil15, which(names(whole) == "Soil_Type")] <- 15
whole[numSoil16, which(names(whole) == "Soil_Type")] <- 16
whole[numSoil17, which(names(whole) == "Soil_Type")] <- 17
whole[numSoil18, which(names(whole) == "Soil_Type")] <- 18
whole[numSoil19, which(names(whole) == "Soil_Type")] <- 19
whole[numSoil20, which(names(whole) == "Soil_Type")] <- 20
whole[numSoil21, which(names(whole) == "Soil_Type")] <- 21
whole[numSoil22, which(names(whole) == "Soil_Type")] <- 22
whole[numSoil23, which(names(whole) == "Soil_Type")] <- 23
whole[numSoil24, which(names(whole) == "Soil_Type")] <- 24
whole[numSoil25, which(names(whole) == "Soil_Type")] <- 25
whole[numSoil26, which(names(whole) == "Soil_Type")] <- 26
whole[numSoil27, which(names(whole) == "Soil_Type")] <- 27
whole[numSoil28, which(names(whole) == "Soil_Type")] <- 28
whole[numSoil29, which(names(whole) == "Soil_Type")] <- 29
whole[numSoil30, which(names(whole) == "Soil_Type")] <- 30
whole[numSoil31, which(names(whole) == "Soil_Type")] <- 31
whole[numSoil32, which(names(whole) == "Soil_Type")] <- 32
whole[numSoil33, which(names(whole) == "Soil_Type")] <- 33
whole[numSoil34, which(names(whole) == "Soil_Type")] <- 34
whole[numSoil35, which(names(whole) == "Soil_Type")] <- 35
whole[numSoil36, which(names(whole) == "Soil_Type")] <- 36
whole[numSoil37, which(names(whole) == "Soil_Type")] <- 37
whole[numSoil38, which(names(whole) == "Soil_Type")] <- 38
whole[numSoil39, which(names(whole) == "Soil_Type")] <- 39
whole[numSoil40, which(names(whole) == "Soil_Type")] <- 40
# deleting columns of ID、Wilderness_Area1-4、Soil_Type
train_original = whole[1:15120,
c(-which(names(whole) =="Id"),
-which(names(whole) =="Wilderness_Area1"),
-which(names(whole) =="Wilderness_Area2"),
-which(names(whole) =="Wilderness_Area3"),
-which(names(whole) =="Wilderness_Area4"),
-which(names(whole) =="Soil_Type")
)]
test_original = whole[15121:581012,
c(-which(names(whole) =="Id"),
-which(names(whole) =="Wilderness_Area1"),
-which(names(whole) =="Wilderness_Area2"),
-which(names(whole) =="Wilderness_Area3"),
-which(names(whole) =="Wilderness_Area4"),
-which(names(whole) =="Soil_Type")
)]
# encoding the categorical features as factors
whole$Wilderness_Area <- factor(whole$Wilderness_Area)
head(whole$Wilderness_Area)
[1] 1 1 1 1 1 1
Levels: 1 2 3 4
[1] 29 29 12 30 29 29
40 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 40
ggplot(filter(whole, is.na(Cover_Type)==FALSE), aes(x=Aspect)) +
geom_point(aes(y=Hillshade_9am, color="Hillshade_9am"), alpha=.1) +
geom_point(aes(y=Hillshade_3pm, color="Hillshade_3pm"), alpha=.1)
# Exploratory data analysis on relationship be Cover_Type and Wilderness_Area
ggplot(filter(whole, is.na(Cover_Type)==FALSE),
aes(Wilderness_Area, Cover_Type)
) +
geom_boxplot(aes(col = Wilderness_Area)
) +
theme_bw() +
ggtitle("Cover_type based on Wilderness_Area")
# Exploratory Data Analysis on Cover_Type and Soil_Type
ggplot(filter(whole, is.na(Cover_Type)==FALSE),
aes(Soil_Type, Cover_Type)
) +
geom_boxplot(aes(col = Soil_Type)) +
theme_bw() +
ggtitle("Cover_type based on Soil_Type")
It can be seen that soil type whose index numbers are closing also have similar relationship with cover types.
# checking correlation of numeric variables
train_num = select_if(train, is.numeric)
#correlation matric & shrinking the size of labels
corrplot(cor(train_num),tl.cex=0.5)
Warning in cor(train_num): 标准差为零
Here we find the columns full of 0. We will delete them later.
# splitting the training set into the training set and validation set
set.seed(789)
split = sample.split(train_original$Cover_Type, SplitRatio = 0.8)
train = subset(train_original, split == TRUE)
validation = subset(train_original, split == FALSE)
Another correlation to see the correlation involving categorical variables.
library(polycor)
#hetcor(train)
#cross validation
set.seed(123)
train.control = trainControl(method = "repeatedcv", number =10, repeats=3)
# checking if every value in soil tyoe columns is 0 after split
for (i in which(names(train) == "Soil_Type1"):51) {print(sum(train[,i]))}
# deleting columns full of value 0
train <- train[,c(-which(names(train) == "Soil_Type7"),
-which(names(train) == "Soil_Type15")
)
]
head(train)
Elevation Aspect Slope Horizontal_Distance_To_Hydrology
1 2596 51 3 258
2 2590 56 2 212
3 2804 139 9 268
4 2785 155 18 242
5 2595 45 2 153
6 2579 132 6 300
Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
1 0 510
2 -6 390
3 65 3180
4 118 3090
5 -1 391
6 -15 67
Hillshade_9am Hillshade_Noon Hillshade_3pm
1 221 232 148
2 220 235 151
3 234 238 135
4 238 238 122
5 220 234 150
6 230 237 140
Horizontal_Distance_To_Fire_Points Soil_Type1 Soil_Type2 Soil_Type3
1 6279 0 0 0
2 6225 0 0 0
3 6121 0 0 0
4 6211 0 0 0
5 6172 0 0 0
6 6031 0 0 0
Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type8 Soil_Type9 Soil_Type10
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type16 Soil_Type17
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21 Soil_Type22 Soil_Type23
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27 Soil_Type28 Soil_Type29
1 0 0 0 0 0 1
2 0 0 0 0 0 1
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 1
6 0 0 0 0 0 1
Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
1 0 0 0 0 0 5
2 0 0 0 0 0 5
3 0 0 0 0 0 2
4 0 0 0 0 0 2
5 0 0 0 0 0 5
6 0 0 0 0 0 2
Wilderness_Area
1 1
2 1
3 1
4 1
5 1
6 1
# checking the correlation
train_num = select_if(train, is.numeric)
corrplot(cor(train_num),tl.cex=0.5)
Welch Two Sample t-test
data: Elevation by Soil_Type1
t = 66.578, df = 409.56, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
587.9997 623.7783
sample estimates:
mean in group 0 mean in group 1
2764.166 2158.277
It is self-evident that there is a strong relationship between soil type and elevation.
# lda model building and CV
model_lda = train(factor(Cover_Type) ~ .,
data=train,
method="lda",
trControl = train.control
)
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold06.Rep1: parameter=none Error in lda.default(x, grouping, ...) :
variable 33 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold09.Rep1: parameter=none Error in lda.default(x, grouping, ...) :
variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold04.Rep2: parameter=none Error in lda.default(x, grouping, ...) :
variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold07.Rep2: parameter=none Error in lda.default(x, grouping, ...) :
variable 33 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold07.Rep3: parameter=none Error in lda.default(x, grouping, ...) :
variable 17 appears to be constant within groups
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning: model fit failed for Fold10.Rep3: parameter=none Error in lda.default(x, grouping, ...) :
variable 33 appears to be constant within groups
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
trainInfo, : There were missing values in resampled performance measures.
Warning in lda.default(x, grouping, ...): variables are collinear
Length Class Mode
prior 7 -none- numeric
counts 7 -none- numeric
means 357 -none- numeric
scaling 306 -none- numeric
lev 7 -none- character
svd 6 -none- numeric
N 1 -none- numeric
call 3 -none- call
xNames 51 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 7 -none- character
param 0 -none- list
Linear Discriminant Analysis
12096 samples
49 predictor
7 classes: '1', '2', '3', '4', '5', '6', '7'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 10886, 10885, 10888, 10887, 10886, 10886, ...
Resampling results:
Accuracy Kappa
0.6465129 0.5875977
I use smaller subsets of the original dataset or comment out some lines as the output of html is too too too slow. However, I’ve run a complete model before, so the following comparison is based on that big model.
# make a small subset of train and validation because of the limited computing power
temp_train <- train[1:200,]
validation_temp <- validation[1:50,]
# naive bayes model building and CV
model_nb = train(factor(Cover_Type) ~ .,
data=temp_train,
method="nb",
#trControl = train.control
)
# Fitting Decision Tree Classification Model to the Training set
classifier_tree <- train(factor(Cover_Type) ~.,
data = train,
method = "rpart",
trControl=train.control,
tuneLength = 100
)
Warning: labs do not fit even at cex 0.15, there may be some overplotting
CART
12096 samples
49 predictor
7 classes: '1', '2', '3', '4', '5', '6', '7'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 10886, 10886, 10888, 10886, 10886, 10886, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.000000000 0.7717194 0.733672563
0.001683502 0.7130987 0.665281903
0.003367003 0.6594155 0.602650911
0.005050505 0.6372051 0.576738573
0.006734007 0.6309780 0.569472556
0.008417508 0.6283895 0.566452139
0.010101010 0.6228220 0.559956509
0.011784512 0.6161248 0.552142184
0.013468013 0.6143892 0.550117813
0.015151515 0.6143892 0.550117813
0.016835017 0.6143892 0.550117813
0.018518519 0.6143892 0.550117813
0.020202020 0.6091527 0.544008185
0.021885522 0.5893140 0.520863516
0.023569024 0.5792819 0.509156570
0.025252525 0.5652540 0.492793748
0.026936027 0.5585867 0.485017890
0.028619529 0.5585867 0.485017890
0.030303030 0.5585867 0.485017890
0.031986532 0.5585867 0.485017890
0.033670034 0.5585867 0.485017890
0.035353535 0.5585867 0.485017890
0.037037037 0.5585867 0.485017890
0.038720539 0.5585867 0.485017890
0.040404040 0.5585867 0.485017890
0.042087542 0.5585867 0.485017890
0.043771044 0.5585867 0.485017890
0.045454545 0.5585867 0.485017890
0.047138047 0.5585867 0.485017890
0.048821549 0.5585867 0.485017890
0.050505051 0.5585867 0.485017890
0.052188552 0.5585867 0.485017890
0.053872054 0.5585867 0.485017890
0.055555556 0.5585867 0.485017890
0.057239057 0.5585867 0.485017890
0.058922559 0.5585867 0.485017890
0.060606061 0.5585867 0.485017890
0.062289562 0.5585867 0.485017890
0.063973064 0.5585867 0.485017890
0.065656566 0.5585867 0.485017890
0.067340067 0.5585867 0.485017890
0.069023569 0.5585867 0.485017890
0.070707071 0.5585867 0.485017890
0.072390572 0.5585867 0.485017890
0.074074074 0.5585867 0.485017890
0.075757576 0.5585867 0.485017890
0.077441077 0.5585867 0.485017890
0.079124579 0.5585867 0.485017890
0.080808081 0.5585867 0.485017890
0.082491582 0.5536787 0.479291553
0.084175084 0.5346592 0.457101785
0.085858586 0.5150726 0.434253578
0.087542088 0.4866347 0.401073020
0.089225589 0.4839922 0.397990169
0.090909091 0.4788365 0.391970848
0.092592593 0.4788365 0.391970848
0.094276094 0.4788365 0.391970848
0.095959596 0.4701541 0.381844445
0.097643098 0.4217501 0.325375967
0.099326599 0.4155718 0.318163962
0.101010101 0.4044860 0.305233095
0.102693603 0.4044860 0.305233095
0.104377104 0.4044860 0.305233095
0.106060606 0.4044860 0.305233095
0.107744108 0.4044860 0.305233095
0.109427609 0.4044860 0.305233095
0.111111111 0.4044860 0.305233095
0.112794613 0.4044860 0.305233095
0.114478114 0.4044860 0.305233095
0.116161616 0.4044860 0.305233095
0.117845118 0.4044860 0.305233095
0.119528620 0.4044860 0.305233095
0.121212121 0.4044860 0.305233095
0.122895623 0.4044860 0.305233095
0.124579125 0.4044860 0.305233095
0.126262626 0.4044860 0.305233095
0.127946128 0.4044860 0.305233095
0.129629630 0.4044860 0.305233095
0.131313131 0.4044860 0.305233095
0.132996633 0.4044860 0.305233095
0.134680135 0.4044860 0.305233095
0.136363636 0.4005401 0.300627620
0.138047138 0.3810510 0.277889517
0.139730640 0.3294742 0.217720655
0.141414141 0.3098534 0.194831618
0.143097643 0.2939547 0.176282453
0.144781145 0.2857143 0.166666692
0.146464646 0.2857143 0.166666692
0.148148148 0.2857143 0.166666692
0.149831650 0.2857143 0.166666692
0.151515152 0.2857143 0.166666692
0.153198653 0.2857143 0.166666692
0.154882155 0.2857143 0.166666692
0.156565657 0.2857143 0.166666692
0.158249158 0.2857143 0.166666692
0.159932660 0.2857143 0.166666692
0.161616162 0.2857143 0.166666692
0.163299663 0.2857143 0.166666692
0.164983165 0.2857143 0.166666692
0.166666667 0.1471308 0.005528769
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
train.control = trainControl(method = "repeatedcv",
number =10,
repeats=1
)
classifier_tree2 = train(factor(Cover_Type) ~ Elevation *
Horizontal_Distance_To_Roadways +
Horizontal_Distance_To_Fire_Points +
Horizontal_Distance_To_Hydrology +
Vertical_Distance_To_Hydrology +
Hillshade_9am +
Soil_Type3 +
Soil_Type10 +
Hillshade_Noon +
Hillshade_3pm +
Soil_Type38 +
Soil_Type39 +
Soil_Type4 +
Soil_Type40 +
Soil_Type12 +
Soil_Type32 +
Soil_Type30 +
Soil_Type29 +
Wilderness_Area +
Horizontal_Distance_To_Hydrology +
Aspect,
data = train, method = "rpart",
trControl=train.control,
tuneLength = 100
)
Warning: labs do not fit even at cex 0.15, there may be some overplotting
CART
12096 samples
20 predictor
7 classes: '1', '2', '3', '4', '5', '6', '7'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 10885, 10887, 10886, 10886, 10886, 10887, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.000000000 0.7628125 0.7232804
0.001683502 0.7115559 0.6634826
0.003367003 0.6550894 0.5976026
0.005050505 0.6449210 0.5857416
0.006734007 0.6310333 0.5695378
0.008417508 0.6292977 0.5675129
0.010101010 0.6221046 0.5591208
0.011784512 0.6154099 0.5513118
0.013468013 0.6141696 0.5498629
0.015151515 0.6141696 0.5498629
0.016835017 0.6141696 0.5498629
0.018518519 0.6141696 0.5498629
0.020202020 0.6095436 0.5444658
0.021885522 0.5929237 0.5250756
0.023569024 0.5774631 0.5070431
0.025252525 0.5631631 0.4903568
0.026936027 0.5596076 0.4862085
0.028619529 0.5596076 0.4862085
0.030303030 0.5596076 0.4862085
0.031986532 0.5596076 0.4862085
0.033670034 0.5596076 0.4862085
0.035353535 0.5596076 0.4862085
0.037037037 0.5596076 0.4862085
0.038720539 0.5596076 0.4862085
0.040404040 0.5596076 0.4862085
0.042087542 0.5596076 0.4862085
0.043771044 0.5596076 0.4862085
0.045454545 0.5596076 0.4862085
0.047138047 0.5596076 0.4862085
0.048821549 0.5596076 0.4862085
0.050505051 0.5596076 0.4862085
0.052188552 0.5596076 0.4862085
0.053872054 0.5596076 0.4862085
0.055555556 0.5596076 0.4862085
0.057239057 0.5596076 0.4862085
0.058922559 0.5596076 0.4862085
0.060606061 0.5596076 0.4862085
0.062289562 0.5596076 0.4862085
0.063973064 0.5596076 0.4862085
0.065656566 0.5596076 0.4862085
0.067340067 0.5596076 0.4862085
0.069023569 0.5596076 0.4862085
0.070707071 0.5596076 0.4862085
0.072390572 0.5596076 0.4862085
0.074074074 0.5596076 0.4862085
0.075757576 0.5596076 0.4862085
0.077441077 0.5596076 0.4862085
0.079124579 0.5596076 0.4862085
0.080808081 0.5596076 0.4862085
0.082491582 0.5526712 0.4781160
0.084175084 0.5385272 0.4616040
0.085858586 0.5228183 0.4432765
0.087542088 0.4876821 0.4022962
0.089225589 0.4876821 0.4022962
0.090909091 0.4876821 0.4022962
0.092592593 0.4876821 0.4022962
0.094276094 0.4876821 0.4022962
0.095959596 0.4702369 0.3819302
0.097643098 0.4213806 0.3249348
0.099326599 0.4125303 0.3146066
0.101010101 0.4045964 0.3053619
0.102693603 0.4045964 0.3053619
0.104377104 0.4045964 0.3053619
0.106060606 0.4045964 0.3053619
0.107744108 0.4045964 0.3053619
0.109427609 0.4045964 0.3053619
0.111111111 0.4045964 0.3053619
0.112794613 0.4045964 0.3053619
0.114478114 0.4045964 0.3053619
0.116161616 0.4045964 0.3053619
0.117845118 0.4045964 0.3053619
0.119528620 0.4045964 0.3053619
0.121212121 0.4045964 0.3053619
0.122895623 0.4045964 0.3053619
0.124579125 0.4045964 0.3053619
0.126262626 0.4045964 0.3053619
0.127946128 0.4045964 0.3053619
0.129629630 0.4045964 0.3053619
0.131313131 0.4045964 0.3053619
0.132996633 0.4045964 0.3053619
0.134680135 0.4045964 0.3053619
0.136363636 0.4045964 0.3053619
0.138047138 0.3803710 0.2770475
0.139730640 0.3215085 0.2083673
0.141414141 0.2857143 0.1666670
0.143097643 0.2857143 0.1666670
0.144781145 0.2857143 0.1666670
0.146464646 0.2857143 0.1666670
0.148148148 0.2857143 0.1666670
0.149831650 0.2857143 0.1666670
0.151515152 0.2857143 0.1666670
0.153198653 0.2857143 0.1666670
0.154882155 0.2857143 0.1666670
0.156565657 0.2857143 0.1666670
0.158249158 0.2857143 0.1666670
0.159932660 0.2857143 0.1666670
0.161616162 0.2857143 0.1666670
0.163299663 0.2857143 0.1666670
0.164983165 0.2857143 0.1666670
0.166666667 0.1422784 0.0000000
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
# Fitting Random Forest Classification Model to the Training set
classifier_forest = train(factor(Cover_Type) ~ Elevation +
Horizontal_Distance_To_Roadways +
Horizontal_Distance_To_Fire_Points +
Horizontal_Distance_To_Hydrology +
Hillshade_9am +
Soil_Type3 +
Soil_Type10,
data = train,
method = "rf")
#trControl=train.control)
set.seed(567)
classifier = randomForest(factor(Cover_Type) ~ Elevation +
Horizontal_Distance_To_Roadways +
Horizontal_Distance_To_Fire_Points +
Horizontal_Distance_To_Hydrology +
Hillshade_9am +
Soil_Type3 +
Soil_Type10,
data = train)
# Choosing the number of trees
plot(classifier)
The Random Forest model still has potential as I only choose 7 important features among about 50 variables because of the limited power of CPU. But it still wins the Decision Tree model.
# Fitting Boosted Trees Classification Model to the Training set
classifier_btree <- train(factor(Cover_Type) ~ Elevation +
Horizontal_Distance_To_Roadways +
Horizontal_Distance_To_Fire_Points +
Horizontal_Distance_To_Hydrology +
Hillshade_9am +
Soil_Type3 +
Soil_Type10,
data = temp_train,
method = "gbm",
trControl=train.control#,
#tuneLength = 100
)
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 6: Soil_Type3 has no variation.
Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
"bernoulli", : variable 7: Soil_Type10 has no variation.
# prediction
y_pred_lda = predict(model_lda, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_lda) # Confusion matrix
y_pred_lda
1 2 3 4 5 6 7
1 292 81 1 0 15 2 41
2 99 219 11 0 79 21 3
3 0 2 228 46 17 139 0
4 0 0 53 334 0 45 0
5 11 82 34 0 288 17 0
6 0 16 106 25 28 257 0
7 85 0 2 0 1 0 344
error_lda <- mean(validation$Cover_Type != y_pred_lda) # Misclassification error
paste('Accuracy',round(1-error_lda,4))
[1] "Accuracy 0.6488"
# prediction
y_pred_nb = predict(model_nb, newdata = validation_temp)
# Checking the prediction accuracy
table(validation_temp$Cover_Type, y_pred_nb) # Confusion matrix
error_nb <- mean(validation_temp$Cover_Type != y_pred_nb) # Misclassification error
paste('Accuracy',round(1-error_nb,4))
“Accuracy 0.64”
Althou in this demo, the NB model seems so-so, the complete model I run before actually gave a very bad prediction about 0.2. One of the reason may be the NB model assumes independence among variables. This apparently conflicts with our data.
# prediction
y_pred_knn = predict(classifier_knn, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_knn) # Confusion matrix
y_pred_knn
1 2 3 4 5 6 7
1 303 67 0 0 16 1 45
2 81 262 16 0 41 20 12
3 0 11 348 17 6 50 0
4 0 0 3 419 0 10 0
5 3 9 4 0 414 2 0
6 1 5 29 19 3 375 0
7 7 2 0 0 0 0 423
error_knn <- mean(validation$Cover_Type != y_pred_knn) # Misclassification error
paste('Accuracy',round(1-error_knn,4))
[1] "Accuracy 0.8413"
# prediction
y_pred_tree = predict(classifier_tree, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_tree) # Confusion matrix
y_pred_tree
1 2 3 4 5 6 7
1 290 90 1 0 9 3 39
2 95 238 16 0 56 23 4
3 0 6 306 27 9 84 0
4 0 0 15 412 0 5 0
5 3 26 8 0 387 8 0
6 1 1 100 29 10 291 0
7 31 3 0 0 1 0 397
error_tree <- mean(validation$Cover_Type != y_pred_tree) # Misclassification error
paste('Accuracy',round(1-error_tree,4))
[1] "Accuracy 0.7675"
# prediction of tree2
y_pred_tree2 = predict(classifier_tree2, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_tree2) # Confusion matrix
y_pred_tree2
1 2 3 4 5 6 7
1 286 100 1 0 10 4 31
2 93 233 14 0 63 22 7
3 2 5 320 19 7 79 0
4 0 0 21 404 0 7 0
5 4 26 5 0 390 7 0
6 2 2 89 30 13 296 0
7 36 6 0 0 1 0 389
error_tree2 <- mean(validation$Cover_Type != y_pred_tree2) # Misclassification error
paste('Accuracy',round(1-error_tree2,4))
[1] "Accuracy 0.7665"
# prediction on random forest
y_pred_rf = predict(classifier_forest, newdata = validation)
# Checking the prediction accuracy
table(validation$Cover_Type, y_pred_rf) # Confusion matrix
y_pred_rf
1 2 3 4 5 6 7
1 320 58 0 0 11 3 40
2 92 254 19 0 43 17 7
3 0 10 352 14 5 51 0
4 0 0 7 424 0 1 0
5 1 6 6 0 417 2 0
6 1 0 49 17 3 362 0
7 10 1 0 0 2 0 419
error_rf <- mean(validation$Cover_Type != y_pred_rf) # Misclassification error
paste('Accuracy',round(1-error_rf,4))
[1] "Accuracy 0.8426"
y_pred_btree = predict(classifier_btree, newdata = validation_temp)
# Checking the prediction accuracy
table(validation_temp$Cover_Type, y_pred_btree) # Confusion matrix
y_pred_btree
1 2 5
1 5 10 0
2 5 26 1
5 0 0 3
error_btree <- mean(validation_temp$Cover_Type != y_pred_btree) # Misclassification error
paste('Accuracy',round(1-error_btree,4))
[1] "Accuracy 0.68"
Random forest combines the output of multiple decision trees which are somehow “biased” in evaluation. Thus the RF model improves a lot on the foundation of Decision Trees.
To make the output of html faster, I comment out these lines about writing csv files.
test_original$Cover_Type = predict(model_lda, newdata = test_original)
write.csv(test_original, file = "predicted_lda.csv")
test_original$Cover_Type = predict(model_nb, newdata = test_original)
write.csv(test_original, file = "predicted_nb.csv")
test_original$Cover_Type = predict(classifier_knn, newdata = test_original)
write.csv(test_original, file = "predicted_knn.csv")
test_original$Cover_Type = predict(classifier_tree, newdata = test_original)
write.csv(test_original, file = "predicted_decision_tree.csv")
test_original$Cover_Type = predict(classifier_forest, newdata = test_original)
write.csv(test_original, file = "predicted_random_forest.csv")
There is no kaggle score of boosted trees as the above model is just a demo on a limited dataset because of the lack of computing power on the laptop. However, the performance has already been really good on the subset, which is far beyond a random guess among seven types. Thus, we can confidently estimate that the boosted trees model will also give a good performance with the whole training data.
Id | Model | Score |
---|---|---|
1 | LDA | 0.58346 |
2 | Naive Bayes | too slow to predict |
3 | kNN | 0.68723 |
4 | Decision Tree | 0.63211 |
5 | Random Forest | 0.68840 |
6 | Bossted Trees | lack |
To sum up, the Random Forest model is the optimal model among all six models. Although some models can get an accuracy around 0.8 on validation set, they cannot go beyond 0.7 when evaluated by kaggle system. This may due to the much larger size of test dataset than the training dataset.