Poisonous Mushroom Detection Using Decision Classifiers
This project demonstrates the application of decision tree algorithms to classify mushrooms into poisonous and edible according to their characteristics. The dataset is obtained from the machine learning repository of University of California Irvine (UCI).
The Guide in the original dataset states that there is no simple rule for determining the edibility of a mushroom; This makes finding a model that can accurately differentiate poisonous mushrooms from edibles very valuable.
Data in csv format (zipped) is available in this link.
Data Metadata
The following describes the values that each column in the dataframe represents:
- cap-shape: bell, conical, convex, flat, knobbed, sunken
- cap-surface: fibrous, grooves, scaly, smooth
- cap-color: brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow
- bruises?: bruises, no
- odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy
- gill-attachment: attached, descending, free, notched
- gill-spacing: close, crowded, distant
- gill-size: broad, narrow
- gill-color: black, brown, buff, chocolate, gray, green, orange, pink, purple, red, white, yellow
- stalk-shape: enlarging, tapering
- stalk-root: bulbous, club, cup, equal, rhizomorphs, rooted, missing
- stalk-surface-above-ring: fibrous, scaly, silky, smooth
- stalk-surface-below-ring: fibrous, scaly, silky, smooth
- stalk-color-above-ring: brown, buff, cinnamon, gray, orange, pink, red, white, yellow
- stalk-color-below-ring: brown, buff, cinnamon, gray, orange, pink, red, white, yellow
- veil-type: partial, universal
- veil-color: brown, orange, white, yellow
- ring-number: none, one, two
- ring-type: cobwebby, evanescent, flaring, large, none, pendant, sheathing, zone
- spore-print-color: black, brown, buff, chocolate, green, orange, purple, white, yellow
- population: abundant, clustered, numerous, scattered, several, solitary
- habitat: grasses, leaves, meadows, paths, urban, waste, woods
The target column is the outcome (edible | poisonous) that we would like our model to predict.
The image below is provided as a visual aid for the reader to recognize different features of a mushroom.
Loading the libraries
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)
import matplotlib.pyplot as plt
%matplotlib inline
Loading the Data
mushroom_df = pd.read_csv('mushrooms.csv')
Data Exploration
mushroom_df.iloc[:5,:13]
target | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | edible | convex | smooth | white | bruises | almond | free | crowded | narrow | white | tapering | bulbous | smooth |
1 | edible | convex | smooth | white | bruises | almond | free | crowded | narrow | white | tapering | bulbous | smooth |
2 | edible | convex | smooth | white | bruises | almond | free | crowded | narrow | pink | tapering | bulbous | smooth |
3 | edible | convex | smooth | white | bruises | almond | free | crowded | narrow | pink | tapering | bulbous | smooth |
4 | edible | convex | smooth | white | bruises | almond | free | crowded | narrow | brown | tapering | bulbous | smooth |
mushroom_df.iloc[:5,13:]
stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|
0 | smooth | white | white | partial | white | one | pendant | purple | several | woods |
1 | smooth | white | white | partial | white | one | pendant | brown | several | woods |
2 | smooth | white | white | partial | white | one | pendant | purple | several | woods |
3 | smooth | white | white | partial | white | one | pendant | brown | several | woods |
4 | smooth | white | white | partial | white | one | pendant | purple | several | woods |
print(mushroom_df.info());
RangeIndex: 8416 entries, 0 to 8415 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target 8416 non-null category 1 cap-shape 8416 non-null category 2 cap-surface 8416 non-null category 3 cap-color 8416 non-null category 4 bruises 8416 non-null category 5 odor 8416 non-null category 6 gill-attachment 8416 non-null category 7 gill-spacing 8416 non-null category 8 gill-size 8416 non-null category 9 gill-color 8416 non-null category 10 stalk-shape 8416 non-null category 11 stalk-root 8416 non-null category 12 stalk-surface-above-ring 8416 non-null category 13 stalk-surface-below-ring 8416 non-null category 14 stalk-color-above-ring 8416 non-null category 15 stalk-color-below-ring 8416 non-null category 16 veil-type 8416 non-null category 17 veil-color 8416 non-null category 18 ring-number 8416 non-null category 19 ring-type 8416 non-null category 20 spore-print-color 8416 non-null category 21 population 8416 non-null category 22 habitat 8416 non-null category dtypes: category(23) memory usage: 194.1 KB
We start by looking at the datatypes of the feature and output variables.
mushroom_df.dtypes
target object cap-shape object cap-surface object cap-color object bruises object odor object gill-attachment object gill-spacing object gill-size object gill-color object stalk-shape object stalk-root object stalk-surface-above-ring object stalk-surface-below-ring object stalk-color-above-ring object stalk-color-below-ring object veil-type object veil-color object ring-number object ring-type object spore-print-color object population object habitat object dtype: object
We first change the type of all columns from object
to category
.
from pandas.api.types import CategoricalDtype
mushroom_df = mushroom_df.astype("category")
mushroom_df.dtypes
target category cap-shape category cap-surface category cap-color category bruises category odor category gill-attachment category gill-spacing category gill-size category gill-color category stalk-shape category stalk-root category stalk-surface-above-ring category stalk-surface-below-ring category stalk-color-above-ring category stalk-color-below-ring category veil-type category veil-color category ring-number category ring-type category spore-print-color category population category habitat category dtype: object
target
is the output column and the rest are the features of mushroom.
features = list(mushroom_df.columns[mushroom_df.columns!='target'])
print(features)
['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
Next, we divide the data into the feature matrix $X$ and outcome vector $y$.
X = mushroom_df[features].values #features
y = mushroom_df['target'].values #target class
Encoding the categorical features
As we noticed, all the features in this dataset are categorical features but machine learning models require all input and output variables to be numeric. There are multiple aproaches to convert categorical variables into numerics such as Ordinal Encoding, One-Hot Encoding, and Dummy Varibale Encoding. For machine-learning applications, it's almost always safer to handle categorical data using One-Hot encoding. We use sklearn
's OneHotEncoder
in this project.
Let's briefly overview what happens when we One-Hot encode a categorical feature. Let's assume we have a feature column color with colors yellow, blue, and purple.
index | color |
---|---|
0 | blue |
1 | purple |
2 | blue |
3 | blue |
4 | blue |
5 | yellow |
6 | blue |
7 | blue |
8 | yellow |
9 | blue |
10 | purple |
11 | purple |
One-Hot encoder encodes each color using using a unique vector length $n$ with $n-1$ $0$s and one $1$ where $n$ is the number of unique colors. The position of $1$ in the vector depends on the color. For example the One-Hot encoded representation of the above example will look like:
index | color_blue | color_purple | color_yellow |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
5 | 0 | 0 | 1 |
6 | 1 | 0 | 0 |
7 | 1 | 0 | 0 |
8 | 0 | 0 | 1 |
9 | 1 | 0 | 0 |
10 | 0 | 1 | 0 |
11 | 0 | 1 | 0 |
Now that we have a basic understanding about encoding the categorical variables into numerics, we can define a function to transform categorical features to One-Hot encoded features.
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
# encode the input data
def prepare_inputs(X_train, X_test):
# Set "handle_unknown" argument to "ignore". This is useful in case the model encounters a
# new feature level. Foe example, you train a model with unique colors "blue", "purple", and "yellow"
# and there is a color "red" appearing in the test data.
oh_encoder = OneHotEncoder(handle_unknown="ignore")
oh_encoder.fit(X_train)
X_train_enc = oh_encoder.transform(X_train)
X_test_enc = oh_encoder.transform(X_test)
return X_train_enc, X_test_enc
# encode the target
def prepare_targets(y_train, y_test):
# LableEncoder is pretty much the same as One-Hot encoder but is used for the target variable (labels)
le = LabelEncoder()
le.fit(y_train)
y_train_enc = le.transform(y_train)
y_test_enc = le.transform(y_test)
return y_train_enc, y_test_enc
Splitting the dataset to training (80%) and test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
The One-Hot encoded features can be obtained using get_feature_names
method of the encoder:
oh_encoder_all = OneHotEncoder()
oh_encoder_all.fit(X)
encoded_features = oh_encoder_all.get_feature_names(features)
print(encoded_features)
['cap-shape_bell' 'cap-shape_conical' 'cap-shape_convex' 'cap-shape_flat' 'cap-shape_knobbed' 'cap-shape_sunken' 'cap-surface_fibrous' 'cap-surface_grooves' 'cap-surface_scaly' 'cap-surface_smooth' 'cap-color_brown' 'cap-color_buff' 'cap-color_cinnamon' 'cap-color_gray' 'cap-color_green' 'cap-color_pink' 'cap-color_purple' 'cap-color_red' 'cap-color_white' 'cap-color_yellow' 'bruises_bruises' 'bruises_no' 'odor_almond' 'odor_anise' 'odor_creosote' 'odor_fishy' 'odor_foul' 'odor_musty' 'odor_none' 'odor_pungent' 'odor_spicy' 'gill-attachment_attached' 'gill-attachment_free' 'gill-spacing_close' 'gill-spacing_crowded' 'gill-size_broad' 'gill-size_narrow' 'gill-color_black' 'gill-color_brown' 'gill-color_buff' 'gill-color_chocolate' 'gill-color_gray' 'gill-color_green' 'gill-color_orange' 'gill-color_pink' 'gill-color_purple' 'gill-color_red' 'gill-color_white' 'gill-color_yellow' 'stalk-shape_enlarging' 'stalk-shape_tapering' 'stalk-root_?' 'stalk-root_bulbous' 'stalk-root_club' 'stalk-root_equal' 'stalk-root_rooted' 'stalk-surface-above-ring_fibrous' 'stalk-surface-above-ring_scaly' 'stalk-surface-above-ring_silky' 'stalk-surface-above-ring_smooth' 'stalk-surface-below-ring_fibrous' 'stalk-surface-below-ring_scaly' 'stalk-surface-below-ring_silky' 'stalk-surface-below-ring_smooth' 'stalk-color-above-ring_brown' 'stalk-color-above-ring_buff' 'stalk-color-above-ring_cinnamon' 'stalk-color-above-ring_gray' 'stalk-color-above-ring_orange' 'stalk-color-above-ring_pink' 'stalk-color-above-ring_red' 'stalk-color-above-ring_white' 'stalk-color-above-ring_yellow' 'stalk-color-below-ring_brown' 'stalk-color-below-ring_buff' 'stalk-color-below-ring_cinnamon' 'stalk-color-below-ring_gray' 'stalk-color-below-ring_orange' 'stalk-color-below-ring_pink' 'stalk-color-below-ring_red' 'stalk-color-below-ring_white' 'stalk-color-below-ring_yellow' 'veil-type_partial' 'veil-color_brown' 'veil-color_orange' 'veil-color_white' 'veil-color_yellow' 'ring-number_none' 'ring-number_one' 'ring-number_two' 'ring-type_evanescent' 'ring-type_flaring' 'ring-type_large' 'ring-type_none' 'ring-type_pendant' 'spore-print-color_black' 'spore-print-color_brown' 'spore-print-color_buff' 'spore-print-color_chocolate' 'spore-print-color_green' 'spore-print-color_orange' 'spore-print-color_purple' 'spore-print-color_white' 'spore-print-color_yellow' 'population_abundant' 'population_clustered' 'population_numerous' 'population_scattered' 'population_several' 'population_solitary' 'habitat_grasses' 'habitat_leaves' 'habitat_meadows' 'habitat_paths' 'habitat_urban' 'habitat_waste' 'habitat_woods']
print(f'The dataset has {X.shape[1]} features and {len(encoded_features)} One-Hot encoded features')
The dataset has 22 features and 117 One-Hot encoded features.
Baseline Classifier
We start off with a baseline solution and train a logistic regression classifier. Later, we compare the performance of the decision tree model with the baseline solution and see if we can come up with a decision tree model that surpasses the performance of the baseline solution.
clf_logreg = LogisticRegression(solver='lbfgs')
clf_logreg.fit(X_train_enc, y_train_enc)
y_pred = clf_logreg.predict(X_test_enc)
We can use different metrics to assess the performance of the logistic regression classifier.
from sklearn.metrics import accuracy_score, classification_report
print(classification_report(y_test_enc, y_pred, digits=3))
precision recall f1-score support 0 1.000 1.000 1.000 1317 1 1.000 1.000 1.000 1208 accuracy 1.000 2525 macro avg 1.000 1.000 1.000 2525 weighted avg 1.000 1.000 1.000 2525
accuracy = accuracy_score(y_test_enc, y_pred)
print('Accuracy: %.2f' % (accuracy*100))
Accuracy: 100.00
The baseline classifier reaches perfect accuracy with 100% accuracy! This makes making a more complex solution pointless as the baseline solution is flawless! Let's reduce the number of training samples to make the problem a bit harder for our baseline classifier!
Using a smaller training set: splitting the dataset to training (10%) and test data (90%)!
# We are going to use indices later!
indices = np.arange(len(mushroom_df))
X_train, X_test, y_train, y_test, tr_ids, test_ids = train_test_split(X, y, indices, test_size=0.9, random_state=10)
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
clf_logreg = LogisticRegression(solver='lbfgs')
clf_logreg.fit(X_train_enc, y_train_enc)
y_pred = clf_logreg.predict(X_test_enc)
print(classification_report(y_test_enc, y_pred, digits=3))
precision recall f1-score support 0 0.996 1.000 0.998 4052 1 1.000 0.995 0.998 3523 accuracy 0.998 7575 macro avg 0.998 0.998 0.998 7575 weighted avg 0.998 0.998 0.998 7575
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test_enc, y_pred)
cm
array([[4052, 0], [ 16, 3507]])
fig, ax = plt.subplots(1, 1, dpi=120)
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['edible', 'poisonous'])
ax.set_title('Logistic regression consfusion matrix')
cm_plot.plot(ax=ax);
Even though we really pushed the limits of the logistic regression classifier by reducing the number of training data, the classifier still does a great job of predicting the correct class. There are, however, 16 instances that the classifier demonstrates type II error by labeling poisonous mushrooms as edible. This is something concerning because in reality this can cause harch consequences. We can now start building the decision tree classifier and see if we can improve the accuracy of the model in identifying poisonous mushrooms compared to the baseline model.
Decision Tree Classifier
In this section we train two decision tree classifiers with the same exact properties except the criterion to assess the quality of a split. One model uses the Gini impurity index whereas the other uses entropy AKA information gain
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
# Create Decision Tree classifer object
clf_gini = DecisionTreeClassifier(criterion="gini", random_state=10, max_depth=5, max_leaf_nodes=10)
clf_entropy = DecisionTreeClassifier(criterion="entropy", random_state=10, max_depth=5, max_leaf_nodes=10)
# Train Decision Tree Classifer
clf_gini = clf_gini.fit(X_train_enc,y_train_enc)
clf_entropy = clf_entropy.fit(X_train_enc,y_train_enc)
# Predict the response for test dataset
y_pred_gini = clf_gini.predict(X_test_enc)
y_pred_entropy = clf_entropy.predict(X_test_enc)
y_pred_gini, digits=3))
precision recall f1-score support 0 0.996 0.994 0.995 4052 1 0.993 0.995 0.994 3523 accuracy 0.995 7575 macro avg 0.995 0.995 0.995 7575 weighted avg 0.995 0.995 0.995 7575
print(classification_report(y_test_enc, y_pred_entropy, digits=3))
precision recall f1-score support 0 0.998 1.000 0.999 4052 1 1.000 0.998 0.999 3523 accuracy 0.999 7575 macro avg 0.999 0.999 0.999 7575 weighted avg 0.999 0.999 0.999 7575
fig, ax = plt.subplots(1, 1, dpi=120)
cm = confusion_matrix(y_test_enc, y_pred_gini)
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['edible', 'poisonous'])
ax.set_title('Decision tree consfusion matrix')
cm_plot.plot(ax=ax);
fig, ax = plt.subplots(1, 1, dpi=120)
cm = confusion_matrix(y_test_enc, y_pred_entropy)
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['edible', 'poisonous'])
ax.set_title('Decision tree consfusion matrix')
cm_plot.plot(ax=ax);
We note that the decision tree with the entropy criterion reaches a higher accuracy than the other tree as well as the baseline classifier. We can generate a plot of the decision tree and look at the rules that the tree has used to defines each split.
from sklearn.tree import export_graphviz
from io import StringIO
from IPython.display import Image
from pydot import graph_from_dot_data
import pydotplus
dot_data = StringIO()
# Training feature names
oh_encoder_tr = OneHotEncoder(handle_unknown="ignore")
oh_encoder_tr.fit(X_train)
encoded_tr_features = oh_encoder_tr.get_feature_names(features)
# Export the tree
export_graphviz(clf_entropy, out_file=dot_data,
filled=True, rounded=True,
special_characters=False,
feature_names = encoded_tr_features,
class_names=['edible', 'poisonous'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png(), width=800, height=400)
Another simple yet useful way to present the decision tree result is using the tree.export_text
method of sklearn
from sklearn.tree import export_text
tree_text = export_text(clf_entropy, feature_names = list(encoded_tr_features))
print(tree_text)
|--- odor_none <= 0.50 | |--- bruises_bruises <= 0.50 | | |--- class: 1 | |--- bruises_bruises > 0.50 | | |--- stalk-root_club <= 0.50 | | | |--- stalk-root_rooted <= 0.50 | | | | |--- gill-spacing_crowded <= 0.50 | | | | | |--- class: 1 | | | | |--- gill-spacing_crowded > 0.50 | | | | | |--- class: 0 | | | |--- stalk-root_rooted > 0.50 | | | | |--- class: 0 | | |--- stalk-root_club > 0.50 | | | |--- class: 0 |--- odor_none > 0.50 | |--- spore-print-color_green <= 0.50 | | |--- stalk-surface-below-ring_scaly <= 0.50 | | | |--- class: 0 | | |--- stalk-surface-below-ring_scaly > 0.50 | | | |--- ring-type_evanescent <= 0.50 | | | | |--- class: 0 | | | |--- ring-type_evanescent > 0.50 | | | | |--- class: 1 | |--- spore-print-color_green > 0.50 | | |--- class: 1
Model diagnosis
One critical step before attempting to improve the accuracy of any model is to figure out what is going wrong. In our case, we would like to see if we can figure out what causes the model giving wrong predictions. First we take a look at the features that are used to train the model and see if there are any features in the test cases that the model has not seen before. This is a common case when dealing with decison tree classifiers with categorical features.
oh_encoder = OneHotEncoder()
oh_encoder.fit(X_test)
encoded_test_features = oh_encoder.get_feature_names(features)
len(encoded_test_features), len(encoded_tr_features)
(117, 113)
We can take a look at the difference between the two lists:
list(encoded_tr_features)))
['cap-shape_conical', 'cap-surface_grooves', 'stalk-color-above-ring_yellow', 'veil-color_yellow']
The model has not seen these features in the training phase. Next, we take a look at the records that have at least one of the above features.
unseen_data = mushroom_df[
(mushroom_df['cap-shape']=='conical')
| (mushroom_df['cap-surface']=='grooves')
| (mushroom_df['veil-color']=='yellow')
| (mushroom_df['stalk-color-above-ring']=='yellow')
]
unseen_data
target | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6064 | poisonous | bell | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6066 | poisonous | conical | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6067 | poisonous | conical | scaly | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6068 | poisonous | flat | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6070 | poisonous | knobbed | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
7880 | poisonous | bell | scaly | yellow | no | none | free | crowded | narrow | yellow | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7881 | poisonous | bell | scaly | yellow | no | none | free | crowded | narrow | white | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7882 | poisonous | conical | scaly | yellow | no | none | free | crowded | narrow | yellow | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7883 | poisonous | conical | scaly | yellow | no | none | free | crowded | narrow | white | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7884 | poisonous | flat | scaly | yellow | no | none | free | crowded | narrow | yellow | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7885 | poisonous | flat | scaly | yellow | no | none | free | crowded | narrow | white | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7886 | poisonous | knobbed | scaly | yellow | no | none | free | crowded | narrow | yellow | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
7887 | poisonous | knobbed | scaly | yellow | no | none | free | crowded | narrow | white | ... | scaly | yellow | yellow | partial | yellow | one | evanescent | white | clustered | leaves |
13 rows × 23 columns
X_test_unseen = unseen_data[features].values
__, X_test_unseen_enc = prepare_inputs(X_train, X_test_unseen)
y_pred_unseen = clf_entropy.predict(X_test_unseen_enc)
y_pred_unseen
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
We notice that 5 out of the 8 wrong predictoins belong to the instances with features that the model has not seen during the training phase. All the wrong predictions can be found using:
false_predictions = np.where(y_test_enc!=y_pred_entropy)
false_predictions
(array([3535, 4237, 4628, 4734, 5692, 6063, 7159, 7283]),)
np.sort(test_ids[false_predictions])
array([6064, 6065, 6066, 6067, 6068, 6069, 6070, 6071])
Now we can take a closer look at the recoreds that were misclassified by the tree:
mushroom_df.iloc[np.sort(test_ids[false_predictions]), :12]
target | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6064 | poisonous | bell | grooves | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6065 | poisonous | bell | scaly | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6066 | poisonous | conical | grooves | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6067 | poisonous | conical | scaly | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6068 | poisonous | flat | grooves | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6069 | poisonous | flat | scaly | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6070 | poisonous | knobbed | grooves | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
6071 | poisonous | knobbed | scaly | white | bruises | none | free | crowded | narrow | white | enlarging | bulbous |
Interestingly, a little bit of investigation shows that the misclassified instances can be found using the following rules:
mushroom_df[
(mushroom_df['cap-color']=='white')
& (mushroom_df['bruises']=='bruises')
& (mushroom_df['odor']=='none')
& (mushroom_df['gill-spacing']=='crowded')
]
target | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6064 | poisonous | bell | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6065 | poisonous | bell | scaly | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6066 | poisonous | conical | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6067 | poisonous | conical | scaly | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6068 | poisonous | flat | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6069 | poisonous | flat | scaly | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6070 | poisonous | knobbed | grooves | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
6071 | poisonous | knobbed | scaly | white | bruises | none | free | crowded | narrow | white | ... | smooth | white | white | partial | white | one | pendant | white | clustered | leaves |
8 rows × 23 columns
But the model has not encountered this situation during the training as seen below:
mushroom_df_tr = mushroom_df.iloc[np.sort(tr_ids), :]
mushroom_df_tr[
(mushroom_df_tr['cap-color']=='white')
& (mushroom_df_tr['bruises']=='bruises')
& (mushroom_df_tr['odor']=='none')
& (mushroom_df_tr['gill-spacing']=='crowded')
]
target | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
---|
0 rows × 23 columns
We can investigate what caused these wrong predictions by considering the featutes that are common between the misclassified instances and walk down the tree to see at which split the misclassification occurs. odor=none, meaning that odor_none$\leq$0.5 is False (because odor_none=1) so in the first split on the tree we move to the right) spore-print-color=white, meaning that spore-print-color_green$\leq$0.5 is True so in the next split on the tree we move to the left stalk-surface-below-ring=smooth, meaning that stalk-surface-below-ring_scaly$\leq$0.5 is True so in the next split on the tree we move to the left. At this point, we have reached a leaf that according to the trained model gives the output for the misclassifies instances to be edible. The process is shown using a purple arrow in the figure below.
Note that what this means is no matter how much we change the tree structure (increase the tree depth, leaf nodes, etc.), the misclassified instances will remain misclassified! In other words, increasing the complexity of the tree is not a solution. We can, however, find a tree with the same exact strucure as the current tree by traying different training-set samplings. We would also test the baseline estimator on these instances for the sake of comparison.
Cross-validation
Here we use cross-validation by re-sampling the training-data 10 times while conserving the size of training set.
num_val_tests = 20
best_tree_model = clf_entropy
best_tree_score = best_tree_model.score(X_test_enc, y_test_enc)
best_logreg_model = clf_logreg
best_logreg_score = best_logreg_model.score(X_test_enc, y_test_enc)
for i in range(num_val_tests):
seed = np.random.randint(1000)
# train/test split
X_train, X_test, y_train, y_test, tr_ids, test_ids = train_test_split(
X, y, indices, test_size=0.9, random_state=seed
)
# instantiate the DecisionTreeClassifier and the
clf_tree_tmp = DecisionTreeClassifier(
criterion="entropy", random_state=10, max_depth=5, max_leaf_nodes=10
)
clf_logreg_tmp = LogisticRegression(solver='lbfgs')
# encode train/test
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
# fit
clf_tree_tmp = clf_tree_tmp.fit(X_train_enc,y_train_enc)
clf_logreg_tmp = clf_logreg_tmp.fit(X_train_enc,y_train_enc)
# evaluate
tree_score = clf_tree_tmp.score(X_test_enc, y_test_enc)
logreg_score = clf_logreg_tmp.score(X_test_enc, y_test_enc)
# compare with the best tree model so far
if tree_score > best_tree_score:
best_tree_score = tree_score
best_tree_model = clf_tree_tmp
X_enc_best_tree = X_test_enc
y_enc_best_tree = y_test_enc
seed_best_tree = seed
# repeat for the logistic regression model
if logreg_score > best_logreg_score:
best_logreg_score = logreg_score
best_logreg_model = clf_logreg_tmp
X_enc_best_logreg = X_test_enc
y_enc_best_logreg = y_test_enc
seed_best_logreg = seed
print(f"best decision tree model reaches the accuracy {np.round(best_tree_score*100, 3)}%")
print(f"best logistic regression model reaches the accuracy {np.round(best_logreg_score*100, 3)}%")
best decision tree model reaches the accuracy 100.0% best logistic regression model reaches the accuracy 99.908%
We note that without having to increase the complexity of the model, we were able to find a perfect model.
y_pred_tree = best_tree_model.predict(X_enc_best_tree)
y_pred_logreg = best_logreg_model.predict(X_enc_best_logreg)
print(classification_report(y_enc_best_tree, y_pred_tree, digits=3))
precision recall f1-score support 0 1.000 1.000 1.000 4015 1 1.000 1.000 1.000 3560 accuracy 1.000 7575 macro avg 1.000 1.000 1.000 7575 weighted avg 1.000 1.000 1.000 7575
fig, ax = plt.subplots(1, 1, dpi=120)
cm = confusion_matrix(y_enc_best_tree, y_pred_tree)
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['edible', 'poisonous'])
ax.set_title('Decision tree consfusion matrix')
cm_plot.plot(ax=ax);
print(classification_report(y_enc_best_logreg, y_pred_logreg, digits=3))
precision recall f1-score support 0 0.999 1.000 0.999 4084 1 1.000 0.998 0.999 3491 accuracy 0.999 7575 macro avg 0.999 0.999 0.999 7575 weighted avg 0.999 0.999 0.999 7575
fig, ax = plt.subplots(1, 1, dpi=120)
cm = confusion_matrix(y_enc_best_logreg, y_pred_logreg)
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['edible', 'poisonous'])
ax.set_title('Logistic regression consfusion matrix')
cm_plot.plot(ax=ax);
Visualizing the best tree
dot_data = StringIO()
# Training feature names
oh_encoder_tr = OneHotEncoder(handle_unknown="ignore")
X_train, X_test, y_train, y_test, tr_ids, test_ids = train_test_split(
X, y, indices, test_size=0.9, random_state=seed_best_tree
)
oh_encoder_tr.fit(X_train)
encoded_tr_features = oh_encoder_tr.get_feature_names(features)
# Export the tree
export_graphviz(best_tree_model, out_file=dot_data,
filled=True, rounded=True,
special_characters=False,
feature_names = encoded_tr_features,
class_names=['edible', 'poisonous'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png(), width=800, height=400)
Note that the features that the best tree is using to split on are very different from the ones that were used previously as they depend on the training set that is used to train the tree.
Conslusion
- In this post we applied a decision tree classifier to detect poisonous mushrooms in a mushroom dataset with 8416 records.
- We learned that decision tree classifiers can handle unknown features in the test set but these unknown features are likely to be misclassified
- We lobserved that there are certain situations that the error is due to lack of training data and the only way to resolve the misclassified instances is to train the model using new data. We achieved this by cross-validation.
- Cross-validation is a great asset to keep the model simple and prevent overfitting.
- We noticed that diagnosing the model before attempting to increase the complexity of the model can sometimes be beneficial and lead to a model with a better accuracy without necessarily adding to the complexity of the model