Random Forest Classification of Mushrooms

There is a plethora of classification algorithms available to people who have a bit of coding experience and a set of data. A common machine learning method is the random forest, which is a good place to start.

This is a use case in R of the randomForest package used on a data set fromUCI‘s Machine Learning Data Repository.

Are These Mushrooms Edible?

If someone gave you thousands of rows of data with dozens of columns about mushrooms, could you identify which characteristics make a mushroom edible or poisonous? How much would you trust your model? Would it be enough for you to make a decision on whether or not to eat a mushroom you find? (That‘s a bad decision roughly 100% of the time).

The randomForest package does all of the heavy lifting behind the scenes. While this "magic" is incredibly nice for the end user, it‘s important to understand what it is you‘re doing. Keep this in mind for absolutely any package you use in R or any other language.

"To know how to run these programs is impressive, but to truly understand how and why they work is what makes you an expert!" -Haley Stoltzman (my wife is a genius)

Here is an article which explains things in layman‘s terms - A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System.

I created a function to grab and clean up the data. This happened to be a very manual process so I borrowed a lot of the code from others. Later on, I found that the data set had already been cleaned up by someone else and presented as a .csv file, but I decided to use my function anyway.

source(‘helper_functions.R‘)
library(randomForest)
library(e1071)
library(caret)
library(ggplot2)
set.seed(123)

I brought the data in as a dataframe, the first column is "Edible" which could be labeled "Class" as this is what we‘re looking for in the classification. We‘ll find only two values here, "Edible" and "Poisonous" (keep in mind that more than two values are easily handled by random forest).

I printed the first few rows and the output shows us there are 23 columns (including "Edible"). I am not a mushroom expert but most of this data makes sense to try and utilize.

#Import Data via Custom Function
data = fetchAndCleanData()
head(data)
##      Edible CapShape CapSurface CapColor Bruises    Odor GillAttachment
## 1 Poisonous   Convex     Smooth    Brown    True Pungent           Free
## 2    Edible   Convex     Smooth   Yellow    True  Almond           Free
## 3    Edible     Bell     Smooth    White    True   Anise           Free
## 4 Poisonous   Convex      Scaly    White    True Pungent           Free
## 5    Edible   Convex     Smooth     Gray   False    None           Free
## 6    Edible   Convex      Scaly   Yellow    True  Almond           Free
##   GillSpacing GillSize GillColor StalkShape StalkRoot
## 1       Close   Narrow     Black  Enlarging     Equal
## 2       Close    Broad     Black  Enlarging      Club
## 3       Close    Broad     Brown  Enlarging      Club
## 4       Close   Narrow     Brown  Enlarging     Equal
## 5     Crowded    Broad     Black   Tapering     Equal
## 6       Close    Broad     Brown  Enlarging      Club
##   StalkSurfaceAboveRing StalkSurfaceBelowRing StalkColorAboveRing
## 1                Smooth                Smooth               White
## 2                Smooth                Smooth               White
## 3                Smooth                Smooth               White
## 4                Smooth                Smooth               White
## 5                Smooth                Smooth               White
## 6                Smooth                Smooth               White
##   StalkColorBelowRing VeilType VeilColor RingNumber   RingType
## 1               White  Partial     White        One    Pendant
## 2               White  Partial     White        One    Pendant
## 3               White  Partial     White        One    Pendant
## 4               White  Partial     White        One    Pendant
## 5               White  Partial     White        One Evanescent
## 6               White  Partial     White        One    Pendant
##   SporePrintColor Population Habitat
## 1           Black  Scattered   Urban
## 2           Brown   Numerous Grasses
## 3           Brown   Numerous Meadows
## 4           Black  Scattered   Urban
## 5           Brown  Abundnant Grasses
## 6           Black   Numerous Grasses

It‘s important to know that R‘s random forest package cannot use rows with missing data. Using the summary() function can help to identify issues. This data doesn‘t have missing information.

summary(data) #no missing data appears
##        Edible        CapShape      CapSurface      CapColor
##  Edible   :4208   Convex :3656   Scaly  :3244   Brown  :2284
##  Poisonous:3916   Flat   :3152   Smooth :2556   Gray   :1840
##                   Knobbed: 828   Fibrous:2320   Red    :1500
##                   Bell   : 452   Grooves:   4   Yellow :1072
##                   Sunken :  32   f      :   0   White  :1040
##                   Conical:   4   g      :   0   Buff   : 168
##                   (Other):   0   (Other):   0   (Other): 220
##   Bruises          Odor         GillAttachment  GillSpacing
##  f    :   0   None   :3528   a         :   0   c      :   0
##  t    :   0   Foul   :2160   f         :   0   w      :   0
##  True :3376   Fishy  : 576   Attached  : 210   Close  :6812
##  False:4748   Spicy  : 576   Descending:   0   Crowded:1312
##               Almond : 400   Free      :7914   Distant:   0
##               Anise  : 400   Notched   :   0
##               (Other): 484
##    GillSize        GillColor        StalkShape     StalkRoot
##  b     :   0   Buff     :1728   e        :   0   Bulbous:3776
##  n     :   0   Pink     :1492   t        :   0   Missing:2480
##  Broad :5612   White    :1202   Enlarging:3516   Equal  :1120
##  Narrow:2512   Brown    :1048   Tapering :4608   Club   : 556
##                Gray     : 752                    Rooted : 192
##                Chocolate: 732                    ?      :   0
##                (Other)  :1170                    (Other):   0
##  StalkSurfaceAboveRing StalkSurfaceBelowRing StalkColorAboveRing
##  Smooth :5176          Smooth :4936          White  :4464
##  Silky  :2372          Silky  :2304          Pink   :1872
##  Fibrous: 552          Fibrous: 600          Gray   : 576
##  Scaly  :  24          Scaly  : 284          Brown  : 448
##  f      :   0          f      :   0          Buff   : 432
##  k      :   0          k      :   0          Orange : 192
##  (Other):   0          (Other):   0          (Other): 140
##  StalkColorBelowRing      VeilType      VeilColor    RingNumber
##  White  :4384        p        :   0   White  :7924   n   :   0
##  Pink   :1872        Partial  :8124   Brown  :  96   o   :   0
##  Gray   : 576        Universal:   0   Orange :  96   t   :   0
##  Brown  : 512                         Yellow :   8   None:  36
##  Buff   : 432                         n      :   0   One :7488
##  Orange : 192                         o      :   0   Two : 600
##  (Other): 156                         (Other):   0
##        RingType     SporePrintColor     Population      Habitat
##  Pendant   :3968   White    :2388   Several  :4040   Woods  :3148
##  Evanescent:2776   Brown    :1968   Solitary :1712   Grasses:2148
##  Large     :1296   Black    :1872   Scattered:1248   Paths  :1144
##  Flaring   :  48   Chocolate:1632   Numerous : 400   Leaves : 832
##  None      :  36   Green    :  72   Abundnant: 384   Urban  : 368
##  e         :   0   Buff     :  48   Clustered: 340   Meadows: 292
##  (Other)   :   0   (Other)  : 144   (Other)  :   0   (Other): 192

I want to explore the data before fitting a model to get an idea of what to expect. I am plotting a variable on two axes and using colors to see the relationship as to whether or not the mushroom is edible or poisonous.

In these plots, edible is shown as green and poisonous is shown as red. I‘m looking for spots where there exists an overwhelming majority of one color.

A comparison of "CapSurface" to "CapShape" shows us:

  • CapShape Bell is more likely to be edible
  • CapShape Convex or Flat have a mix of edible and poisonous and make up the majority of the data
  • CapSurface alone does not tell us a lot of information
  • CapSurface Fibrous + CapShape Bell, Knobbed, or Sunken are likely to be edible
  • These variables will likely increase information gain but may not be incredibly strong

p = ggplot(data,aes(x=CapShape,
                    y=CapSurface,
                    color=Edible))

p + geom_jitter(alpha=0.3) +
  scale_color_manual(breaks = c(‘Edible‘,‘Poisonous‘),
                     values=c(‘darkgreen‘,‘red‘))

A comparison of "StalkColorBelowRing" to "StalkColorAboveRing" shows us:

  • StalkColorAboveRing Gray is almost always going to be edible
  • StalkColorBelowRing Gray is almost always going to be edible
  • StalkColorBelowRing Buff is almost always going to be poisonous
  • This list could go on...
  • These variables are likely to increase information gain by a fair amount

p = ggplot(data,aes(x=StalkColorBelowRing,
                    y=StalkColorAboveRing,
                    color=Edible))

p + geom_jitter(alpha=0.3) +
  scale_color_manual(breaks = c(‘Edible‘,‘Poisonous‘),
                     values=c(‘darkgreen‘,‘red‘))

A comparison of "Odor" to "SporePrintColor" shows us:

  • Odor Foul, Fishy, Pungent, Creosote, and Spicy are highly likely to be poisonous
  • Odor Almond and Anise are highly likely to be edible.
  • Odor None appears to be primarily edible
    • However, if it has SporePrintColor Green it is highly likely to be poisonous!
  • These variables are likely going to lead to a lot of information gain

p = ggplot(data,aes(x=Odor,
                    y=SporePrintColor,
                    color=Edible))

p + geom_jitter(alpha=0.3) +
  scale_color_manual(breaks = c(‘Edible‘,‘Poisonous‘),
                     values=c(‘darkgreen‘,‘red‘))

Due to how strong those variables looked, I decided to plot them strictly as edible or poisonous and found:

  • Odor is an excellent indicator of edible or poisonous
  • Odor None is the only tricky one - there is data where it would be classified as edible or poisonous
  • SporePrintColor is not as strong as odor when it stands alone - there is a lot of overlap between the columns

p = ggplot(data,aes(x=Edible,
                    y=Odor,
                    color = Edible))

p + geom_jitter(alpha=0.2) +
  scale_color_manual(breaks = c(‘Edible‘,‘Poisonous‘),
                     values=c(‘darkgreen‘,‘red‘))
p = ggplot(data,aes(x=Edible,
                    y=SporePrintColor,
                    color = Edible))

p + geom_jitter(alpha=0.2) +
  scale_color_manual(breaks = c(‘Edible‘,‘Poisonous‘),
                     values=c(‘darkgreen‘,‘red‘))

Before fitting a model it‘s important to split data into different parts - train and test data. There‘s no perfect way to know exactly how much data you should use to train your model. In this example I split 5% as training and 95% as testing. However, this is not typical, most of what I see is usually around 60%/40% or 70%/30% for test/train split.

If you choose too large of a training set you run the risk of overfitting your model. Overfitting is a classic mistake people make when first entering the field of machine learning. I won‘t go into the details but there are classes dedicated to this subject. Wikipedia Article

Initially, I ran this at higher levels of training data and it had perfect prediction with zero false positives or negatives. That‘s not as fun to look at as an example so I scaled down the training data which created more bad predictions.

#Create data for training
sample.ind = sample(2,
                     nrow(data),
                     replace = T,
                     prob = c(0.05,0.95))
data.dev = data[sample.ind==1,]
data.val = data[sample.ind==2,]

I wanted to know the split of edible to poisonous mushrooms in the data set and compare it to the training and test data. The random sample appears to have created roughly the same ratio of edible to poisonous upon creating train and test data.

Edible % / Poisonous % :

  • Data: 52 / 48
  • Train: 50 / 50
  • Test: 52 / 48
# Original Data
table(data$Edible)/nrow(data)
##    Edible Poisonous
## 0.5179714 0.4820286
# Training Data
table(data.dev$Edible)/nrow(data.dev)
##    Edible Poisonous
## 0.4962779 0.5037221
# Testing Data
table(data.val$Edible)/nrow(data.val)
##    Edible Poisonous
## 0.5191037 0.4808963

I finally fit the random forest model to the training data. Plotting the model shows us that after about 20 trees, not much changes in terms of error. It fluctuates a bit but not to a large degree.

#Fit Random Forest Model
rf = randomForest(Edible ~ .,
                   ntree = 100,
                   data = data.dev)
plot(rf)

Printing the model shows the number of variables tried at each split to be 4 and an OOB estimate of error rate 0.25%. The training model fit the training data almost perfectly. There was only one mushroom which was classified incorrectly. The model would have predicted 1 to be poisonous and it would have turned out to be edible. If we consider edible to be "positive" this means we would have had 1 false negative.

print(rf)
## Call:
##  randomForest(formula = Edible ~ ., data = data.dev, ntree = 100)
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
##
##         OOB estimate of  error rate: 0.25%
## Confusion matrix:
##           Edible Poisonous class.error
## Edible       200         0 0.000000000
## Poisonous      1       202 0.004926108

It‘s always important to look at what is shown in terms of variable importance. This plot indicates what variables had the greatest impact in the classification model.

I limited it to 10 for the plot.

# Variable Importance
varImpPlot(rf,
           sort = T,
           n.var=10,
           main="Top 10 - Variable Importance")

Odor is by far the most important variable in terms of "Mean Decreasing Gini" - a similar term for information gain in this example. The rest of the results are listed below. It‘s interesting to notice "Veil Type" created no information gain - so I looked into it in the initial data. The reason is clear - there is only one VeilType, so it doesn‘t offer any differentiation and couldn‘t possibly impact the results.

#Variable Importance
var.imp = data.frame(importance(rf,
                                 type=2))
# make row names as columns
var.imp$Variables = row.names(var.imp)
print(var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),])

##                       MeanDecreaseGini             Variables
## Odor                        69.3536782                  Odor
## SporePrintColor             27.3837625       SporePrintColor
## GillColor                   18.1981987             GillColor
## StalkSurfaceAboveRing       12.3172400 StalkSurfaceAboveRing
## RingType                    11.3114967              RingType
## GillSize                    11.1085947              GillSize
## Population                   7.2591707            Population
## Bruises                      7.2212660               Bruises
## CapColor                     5.6746095              CapColor
## Habitat                      5.4768013               Habitat
## StalkRoot                    5.3053036             StalkRoot
## StalkSurfaceBelowRing        4.6080070 StalkSurfaceBelowRing
## GillSpacing                  4.1186021           GillSpacing
## StalkShape                   2.6858568            StalkShape
## StalkColorBelowRing          2.5570551   StalkColorBelowRing
## RingNumber                   2.0463027            RingNumber
## StalkColorAboveRing          1.9823127   StalkColorAboveRing
## CapSurface                   1.0200298            CapSurface
## CapShape                     0.5779989              CapShape
## VeilColor                    0.1522645             VeilColor
## GillAttachment               0.0275000        GillAttachment
## VeilType                     0.0000000              VeilType

I decided to use the model to attempt to predict whether or not a mushroom is edible or poisonous based off of the training data set. It predicted the response variable perfectly - having zero false positives or false negatives.

# Predicting response variable
data.dev$predicted.response = predict(rf , data.dev)

# Create Confusion Matrix
print(
confusionMatrix(data = data.dev$predicted.response,
                reference = data.dev$Edible,
                positive = ‘Edible‘))

## Confusion Matrix and Statistics
##
##            Reference
## Prediction  Edible Poisonous
##   Edible       200         0
##   Poisonous      0       203
##
##                Accuracy : 1
##                  95% CI : (0.9909, 1)
##     No Information Rate : 0.5037
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 1
##  Mcnemar‘s Test P-Value : NA
##
##             Sensitivity : 1.0000
##             Specificity : 1.0000
##          Pos Pred Value : 1.0000
##          Neg Pred Value : 1.0000
##              Prevalence : 0.4963
##          Detection Rate : 0.4963
##    Detection Prevalence : 0.4963
##       Balanced Accuracy : 1.0000
##
##        ‘Positive‘ Class : Edible

Now it was time to see how the model did with data it had not seen before - making predictions on the test data.

It did a decent job. It had a 99% accuracy with a very narrow confidence interval. It did have 48 false negatives and 8 false positives (which could be deadly if you were actually choosing to eat mushrooms based off of this model).

# Predicting response variable
data.val$predicted.response <- predict(rf ,data.val)

# Create Confusion Matrix
print(
confusionMatrix(data=data.val$predicted.response,
                reference=data.val$Edible,
                positive=‘Edible‘))

## Confusion Matrix and Statistics
##
##            Reference
## Prediction  Edible Poisonous
##   Edible      3960         8
##   Poisonous     48      3705
##
##                Accuracy : 0.9927
##                  95% CI : (0.9906, 0.9945)
##     No Information Rate : 0.5191
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.9855
##  Mcnemar‘s Test P-Value : 1.872e-07
##
##             Sensitivity : 0.9880
##             Specificity : 0.9978
##          Pos Pred Value : 0.9980
##          Neg Pred Value : 0.9872
##              Prevalence : 0.5191
##          Detection Rate : 0.5129
##    Detection Prevalence : 0.5139
##       Balanced Accuracy : 0.9929
##
##        ‘Positive‘ Class : Edible

Unfortunately, I have no idea how reliable this data is or how it was captured. There is likely some background information and I would never choose whether or not to eat an unknown mushroom based off of this model (and neither should you).

Code used in this post is on my GitHub

转自:https://stoltzmaniac.com/random-forest-classification-of-mushrooms/

时间: 2024-08-07 04:14:57

Random Forest Classification of Mushrooms的相关文章

ML(4.3): R Random Forest

随机森林模型是一种数据挖掘模型,常用于进行分类预测.随机森林模型包含多个树形分类器,预测结果由多个分类器投票得出. 决策树相当于一个大师,通过自己在数据集中学到的知识对于新的数据进行分类.俗话说得好,一个诸葛亮,玩不过三个臭皮匠.随机森林就是希望构建多个臭皮匠,希望最终的分类效果能够超过单个大师的一种算法. 参考资料: https://zhuanlan.zhihu.com/p/24416833 http://www.doc88.com/p-3436627023327.html http://bl

Awesome Random Forest

Awesome Random Forest Random Forest - a curated list of resources regarding tree-based methods and more, including but not limited to random forest, bagging and boosting. Contributing Please feel free to pull requests, email Jung Kwon Lee ([email pro

随机森林(Random Forest)详解(转)

来源: Poll的笔记 cnblogs.com/maybe2030/p/4585705.html 1 什么是随机森林?   作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来源,保留和流失,也可用来预测疾病的风险和病患者的易感性.最初,我是在参加校外竞赛时接触到随机森林算法的.最近几年的国内外大赛,包括2013年百度校园电影推荐系统大赛.2014年阿里巴巴天池大数据竞赛

统计学习方法——CART, Bagging, Random Forest, Boosting

http://blog.csdn.net/abcjennifer/article/details/8164315 本文从统计学角度讲解了CART(Classification And Regression Tree), Bagging(bootstrap aggregation), Random Forest Boosting四种分类器的特点与分类方法,参考材料为密歇根大学Ji Zhu的pdf与组会上王博的讲解. CART(Classification And Regression Tree)

随机森林(Random Forest)

 阅读目录 ?1 什么是随机森林? ?2 随机森林的特点 ?3 随机森林的相关基础知识 ?4 随机森林的生成 ?5 袋外错误率(oob error) ?6 随机森林工作原理解释的一个简单例子 ?7 随机森林的Python实现 ?8 参考内容 1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来源,保留和流失,也可用来预测疾病的风险和病患者的易感性

[Machine Learning] Random Forest 随机森林

1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来源,保留和流失,也可用来预测疾病的风险和病患者的易感性.最初,我是在参加校外竞赛时接触到随机森林算法的.最近几年的国内外大赛,包括2013年百度校园电影推荐系统大赛.2014年阿里巴巴天池大数据竞赛以及Kaggle数据科学竞赛,参赛者对随机森林的使用占有相当高的比例.此外,据我的个人了解来看,一大部

随机森林(Random Forest)--- 转载

1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来源,保留和流失,也可用来预测疾病的风险和病患者的易感性.最初,我是在参加校外竞赛时接触到随机森林算法的.最近几年的国内外大赛,包括2013年百度校园电影推荐系统大赛.2014年阿里巴巴天池大数据竞赛以及Kaggle数据科学竞赛,参赛者对随机森林的使用占有相当高的比例.此外,据我的个人了解来看,一大部

随机森林,决策树(Random Forest)

http://www.cnblogs.com/maybe2030/p/4585705.html 阅读目录 1 什么是随机森林? 2 随机森林的特点 3 随机森林的相关基础知识 4 随机森林的生成 5 袋外错误率(oob error) 6 随机森林工作原理解释的一个简单例子 7 随机森林的Python实现 8 参考内容 回到顶部 1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做

paper 56 :机器学习中的算法:决策树模型组合之随机森林(Random Forest)

周五的组会如约而至,讨论了一个比较感兴趣的话题,就是使用SVM和随机森林来训练图像,这样的目的就是 在图像特征之间建立内在的联系,这个model的训练,着实需要好好的研究一下,下面是我们需要准备的入门资料: [关于决策树的基础知识参考:http://blog.csdn.net/holybin/article/details/22914417] 在机器学习中,随机森林由许多的决策树组成,因为这些决策树的形成采用了随机的方法,所以叫做随机森林.随机森林中的决策树之间是没有关联的,当测试数据进入随机森