Lesson 3#

So far we have only used one classification algorithm, logistic regression. Now let’s see if we can do better with a different algorithm called a “random forest”.

First, let’s download the data and import the libraries we’ll need.

import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
if not os.path.exists("utils.py"):
    !wget https://raw.githubusercontent.com/drivendataorg/tutorial-flu-shot-learning/main/utils.py
from utils import download_data_files

download_data_files()

And we’ll read the data into pandas DataFrames.

labels_df = pd.read_csv("training_set_labels.csv", index_col="respondent_id")
features_df = pd.read_csv("training_set_features.csv", index_col="respondent_id")
test_features_df = pd.read_csv("test_set_features.csv", index_col="respondent_id")

Now we’re ready.

Random forest classification#

A random forest is an example of an “ensemble algorithm”, which means that instead of fitting one large model, it fits many small models and aggregates their results. The small models tend to be simple, and each one is not very accurate by itself. But combining many bad models turns out to be much more effective than you might expect.

In a random forest, the small models are decision trees (so a forest is a collection of trees – get it?). The simplest decision tree has a single branch, which is like an if statement based on a single feature. For example, if we choose opinion_h1n1_risk as a feature, we could look at the fraction of people who got the H1N1 vaccine at each level of perceived risk:

from utils import crosstab

crosstab(labels_df["h1n1_vaccine"], features_df["opinion_h1n1_risk"])
opinion_h1n1_risk 1.0 2.0 3.0 4.0 5.0
h1n1_vaccine
0 91.165991 83.203952 82.63205 60.789766 48.914286
1 8.834009 16.796048 17.36795 39.210234 51.085714

Based on these results, we might create a branch that checks whether opinion_h1n1_risk is greater than 4: if so, it predicts that the person was vaccined; otherwise it predicts they were not. Decision trees can contain multiple branches, based on additional features.

In a random forest, each tree is based on a random subset of the training data and a random subset of the features. So we can quickly generate a large number of different trees, each of which learns different relationships between the features and the labels.

Now suppose we have 100 trees and we want to generate a prediction. For a given person, we use each tree to make a prediction, then aggregate the predictions. A simple way to aggregate is voting: if 70 out of 100 trees predict that a particular person was vaccinated, the predicted probability would be 70%.

Scikit-learn provides RandomForestClassifier, which implements a random forest algorithm. We can use it in a MultiOutputClassifier, just as we did with LogisticRegression.

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

classifier = MultiOutputClassifier(estimator=RandomForestClassifier())

The rest of the pipeline is the same as in the previous lesson.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline(
    [
        ("standard_scaler", StandardScaler()),
        ("simple_imputer", SimpleImputer(strategy="mean")),
    ]
)
numeric_features_df = features_df.select_dtypes(include=np.number)
numeric_cols = numeric_features_df.columns
categoric_features_df = features_df.select_dtypes(include=np.object_)
categoric_cols = categoric_features_df.columns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, numeric_cols),
        ("categoric", OneHotEncoder(), categoric_cols),
    ]
)
model6 = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", classifier),
    ]
)

Now we can fit the model with the training data.

model6.fit(features_df, labels_df)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('standard_scaler',
                                                                   StandardScaler()),
                                                                  ('simple_imputer',
                                                                   SimpleImputer())]),
                                                  Index(['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outsid...
       'household_children'],
      dtype='object')),
                                                 ('categoric', OneHotEncoder(),
                                                  Index(['age_group', 'education', 'race', 'sex', 'income_poverty',
       'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region',
       'census_msa', 'employment_industry', 'employment_occupation'],
      dtype='object'))])),
                ('classifier',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])
Please rerun this cell to show the HTML repr or trust the notebook.

And let’s see what happens when we test it with the training data (and the scoring function from the previous lesson).

from utils import score_model

score_model(model6, features_df, labels_df)
1.0

The model is 100% accurate!

At first this sounds like good news, but it is telling us more about the algorithm than the data. Random forest models have a large number of coefficients; in fact, so many that they can effectively memorize the training data. So if we train and test on the same data, the accuracy seems very high. Then if we test on a new dataset, it is usually lower.

To see how much lower, let’s generate predictions on the test data and submit them on the competition site.

from utils import make_submission

make_submission(model6, test_features_df).to_csv("submission3.csv")

The AUC score on the test data is 0.8503. On one hand, that’s better than the logistic regression model in the previous lesson, which scored 0.8355. So we’re moving up the leaderboard!

But the score on the test data is a lot less than 1. That means that the model is overfit: it has learned details of the training data that don’t apply to other datasets.

Evaluating without peeking#

At this point we have tried two algorithms and experimented with a few variations, like different kinds of regularization. There are more algorithms we could try, and many more variations. But as we’ve just seen, if we train and test a model with the same data, the score we get does not indicate how well it will do with new data.

As we test new algorithms and variations, we could generate predictions, submit them on the competition site, and compute the score on the test data. But that would be a bad idea for several reasons, two practical and one philosophical. The practical reasons are:

  • That process is slow and hard to automate, so it would take a long time to try more than a few experiments.

  • We are only allowed to submit three predictions to the competition site per day.

The philosophical reason is that every time we submit a prediction and get a score, we get some information about the true labels of the test data. As an extreme example, we could start with a random set of predictions, get a baseline score, and then change one prediction at a time. If the score goes up, that prediction was right; if the score goes down, it was wrong. With enough submissions, we could infer all of the true labels and get a perfect score.

Even if we don’t abuse the system as explicitly as that, it is nevertheless true that every submission leaks information about the test data. If we make too many submissions, we are likely to overfit the test data, and then the model will be less accurate with a new, unseen dataset.

So, how can we compare algorithms and tune their parameters without overfitting? One option is to divide the training data into two splits, use one split to fit the model, and use the other to evaluate it.

In this framework, instead of two datasets, training and test, we have three:

  • The training split, used to fit a model and find the best coefficients;

  • The validation split, used to tune a model and find the best parameters;

  • The test set, used to evaluate the model on unseen data.

This framework is such a common part of machine learning practice that Scikit-learn provides functions to support it, including train_test_split. The following example shows how it works, dividing the features into X_train and X_val, and dividing the labels into y_train and y_val.

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    features_df,
    labels_df,
    test_size=0.33,
    shuffle=True,
    stratify=labels_df,
)

test_size specifies that the test split should have 33% of the data, which means that the training split has the other 67%.

shuffle indicates that the rows should be shuffled before splitting, which means that the resulting split is random.

stratify indicates that the splits should be stratified; that is, the distribution of labels in each split should be the same as in labels_df. For example, we’ve seen that 21% of the respondents in the training data got the H1N1 vaccine.

labels_df["h1n1_vaccine"].value_counts(normalize=True)
0    0.787546
1    0.212454
Name: h1n1_vaccine, dtype: float64

If stratification was successful, the proportions should be approximately the same in the splits. And they are.

y_val["h1n1_vaccine"].value_counts(normalize=True)
0    0.787611
1    0.212389
Name: h1n1_vaccine, dtype: float64
y_train["h1n1_vaccine"].value_counts(normalize=True)
0    0.787515
1    0.212485
Name: h1n1_vaccine, dtype: float64

Stratification is generally a good idea, especially if some labels are substantially less common than others.

Now we can fit the model with the just the training split, excluding the validation split.

model6.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('standard_scaler',
                                                                   StandardScaler()),
                                                                  ('simple_imputer',
                                                                   SimpleImputer())]),
                                                  Index(['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outsid...
       'household_children'],
      dtype='object')),
                                                 ('categoric', OneHotEncoder(),
                                                  Index(['age_group', 'education', 'race', 'sex', 'income_poverty',
       'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region',
       'census_msa', 'employment_industry', 'employment_occupation'],
      dtype='object'))])),
                ('classifier',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])
Please rerun this cell to show the HTML repr or trust the notebook.

Then we’ll compute a score with just the validation split.

score_model(model6, X_val, y_val)
0.8538234988169062

If you run the split-train-validate sequence a few times, you will get different scores because the split is random and the random forest is (wait for it) also random. But you should get scores close to 0.85, which is close to the score we got from the test data on the competition site.

So it works! We can get an accurate evaluation of the model without looking at the test data.

Tuning parameters#

Now let’s think about tuning the parameters of the random forest algorithm and see if we can improve the results. Here are the parameters of RandomForestClassifier.

RandomForestClassifier().get_params()
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

As an example, let’s experiment with max_features, which limits the number of features used to build each tree in the forest. The default value is 'sqrt', which means that max_features is chosen automatically by computing the square root of the number of features.

In the input dataset, there are 36 features.

model6.n_features_in_
35

So the value of max_features is 6.

With that in mind, let’s try a range of values from 4 to 14.

In order to set the value of a parameter, we have to select classifier from the pipeline, then select its estimator, and assign a value to one of its parameter. It’s a little clumsy, and we’ll see a better way soon, but here’s what it looks like.

model6["classifier"].estimator.max_features = 10

Now we can loop through a sequence of values, train the model with each value of max_features, score the models, and save the results.

res = []

for max_features in range(4, 15):
    model6["classifier"].estimator.max_features = max_features
    model6.fit(X_train, y_train)
    score = score_model(model6, X_val, y_val)
    res.append((max_features, score))

Here’s what the results look like.

from utils import decorate

x, y = np.transpose(res)
plt.plot(x, y)

decorate(xlabel="max_features", ylabel="AUC score")
_images/03_tutorial_49_0.png

It looks like larger values of max_features are better, but the results are random, so if you run this test a few times, you’ll see that the results vary and the best value might be different from one run to the next.

We can get more reliable results using cross validation, which is the topic of the next section.

Cross-validation#

Instead of using one random split, an alternative is to use multiple non-random splits. One way to do that is \(k\)-fold cross-validation, where \(k\) is the number of splits. With k=5, which is a common choice, it works like this:

  1. Divide the training data into 5 equal-sized parts, called folds.

  2. Train the model using all folds except the first, then use the first fold to validate.

  3. Train the model using all folds except the second, then use the second fold to validate.

  4. Repeat for the other three folds.

  5. Compute the mean of the 5 validation scores.

As you might expect by now, Scikit-learn provides a function that automates this process. Here’s how it works.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model6, features_df, labels_df, cv=5, scoring=score_model)

The arguments are the model, the training features, and the training labels. The parameter cv=5 indicates 5-fold cross-validation. And scoring indicates the function that should be used to compute the scores.

scores
array([0.85485707, 0.851138  , 0.85685131, 0.86273168, 0.85069799])

Now let’s try again to find the best value of max_features. The following loop tests a range of values and uses 5-fold cross-validation for each. It might take a little while to run!

res_cv = []

for max_features in range(4, 15):
    print(max_features)
    model6["classifier"].estimator.max_features = max_features
    scores = cross_val_score(model6, features_df, labels_df, cv=5, scoring=score_model)
    res_cv.append((max_features, scores))
4
5
6
7
8
9
10
11
12
13
14

Here’s what the results look like. The x markers show the results from each of the five folds. The o markers show the average score for each value of max_features.

for x, ys in res_cv:
    plt.plot([x] * 5, ys, "x", color="gray")
    plt.plot(x, ys.mean(), "o", color="C0")

decorate(xlabel="max_features", ylabel="AUC score")
_images/03_tutorial_58_0.png

It looks like the larger values of max_features are better than the lower values, but above 8 or 9 the difference is small and the results are highly variable. However, if the value computed by sqrt is 6, it seems like we could do better by increasing it.

Gradient boosting#

Random forest classifiers are generally pretty good, but in the last decade they have been overshadowed by the success of gradient boosting. Compared to random forests, algorithms using gradient boosting tend to be faster and more accurate.

Gradient boosting is based on two ideas:

  • Instead of fitting many simple models to the same data, we fit a sequence of models where each model tries to correct the errors of the previous model, which is called boosting, and

  • At each stage, we choose the best model by minimizing a “cost” that quantifies the magnitude of the errors. This minimization can be done efficiently using gradient descent algorithms.

So gradient boosting might be more explicitly called “boosting using gradient descent”.

Scikit-learn provides several classification algorithms based on gradient boosting, including AdaBoostClassifier and GradientBoostingClassifier, but the one that’s recommended for datasets with more than 10,000 rows (like ours) is HistGradientBoostingClassifier.

One feature of this classifier is that it can handle missing data, so the preprocessing pipeline can be simple. For the numerical variables, we don’t have to do anything! For the categorical variables, we’ll use the ordinal encoder.

from sklearn.preprocessing import OrdinalEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", "passthrough", numeric_cols),
        ("categoric", OrdinalEncoder(), categoric_cols),
    ]
)

Now, when we make the HistGradientBoostingClassifier, we have to tell it which variables are categoric, which we’ll do by providing their indices in the list of features.

categoric_indices = [features_df.columns.get_loc(col) for col in categoric_cols]
categoric_indices
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 33, 34]
from sklearn.ensemble import HistGradientBoostingClassifier

classifier = MultiOutputClassifier(
    estimator=HistGradientBoostingClassifier(
        categorical_features=categoric_indices,
    )
)

Here’s the complete model.

model7 = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", classifier),
    ]
)

And we can fit it as usual.

model7.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric', 'passthrough',
                                                  Index(['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'c...
                                                 ('categoric', OrdinalEncoder(),
                                                  Index(['age_group', 'education', 'race', 'sex', 'income_poverty',
       'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region',
       'census_msa', 'employment_industry', 'employment_occupation'],
      dtype='object'))])),
                ('classifier',
                 MultiOutputClassifier(estimator=HistGradientBoostingClassifier(categorical_features=[21,
                                                                                                      22,
                                                                                                      23,
                                                                                                      24,
                                                                                                      25,
                                                                                                      26,
                                                                                                      27,
                                                                                                      28,
                                                                                                      29,
                                                                                                      30,
                                                                                                      33,
                                                                                                      34])))])
Please rerun this cell to show the HTML repr or trust the notebook.

It takes less time to train than the previous classifiers. And it gets a better score, too.

score_model(model7, X_val, y_val)
0.8619246127098983

Exercise: Use cross-validation to get a better sense of how well this classifier works.

# Solution

scores = cross_val_score(model7, features_df, labels_df, cv=5, scoring=score_model)
scores.mean()
0.8645562533003464

Exercise: Choose one of the parameters and use a grid search to tune it. This article might give you some ideas for which parameters would be good choices.

Use the best model you find to generate predictions and submit them to the platform.

model7["classifier"].estimator.get_params()
{'categorical_features': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 33, 34],
 'early_stopping': 'auto',
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'auto',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 100,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': None,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}
# Solution

# As an example, I chose to search the learning rate, but you should try other parameters,
# maybe more than one.

params = {
    "classifier__estimator__learning_rate": [0.01, 0.03, 0.05, 0.07, 0.1],
}
# Solution

from sklearn.model_selection import GridSearchCV

searcher = GridSearchCV(
    estimator=model7,
    param_grid=params,
    scoring=score_model,
    cv=5,
    n_jobs=-1,
    verbose=1,
)
# Solution

%time searcher.fit(X_train, y_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
CPU times: user 10.7 s, sys: 59.9 ms, total: 10.7 s
Wall time: 19.7 s
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         'passthrough',
                                                                         Index(['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n...
       'census_msa', 'employment_industry', 'employment_occupation'],
      dtype='object'))])),
                                       ('classifier',
                                        MultiOutputClassifier(estimator=HistGradientBoostingClassifier(categorical_features=[21,
                                                                                                                             22,
                                                                                                                             23,
                                                                                                                             24,
                                                                                                                             25,
                                                                                                                             26,
                                                                                                                             27,
                                                                                                                             28,
                                                                                                                             29,
                                                                                                                             30,
                                                                                                                             33,
                                                                                                                             34])))]),
             n_jobs=-1,
             param_grid={'classifier__estimator__learning_rate': [0.01, 0.03,
                                                                  0.05, 0.07,
                                                                  0.1]},
             scoring=<function score_model at 0x7ff6715fb560>, verbose=1)
Please rerun this cell to show the HTML repr or trust the notebook.
# Solution

searcher.best_params_
{'classifier__estimator__learning_rate': 0.05}
# Solution

searcher.best_score_
0.8620050128347211
# Solution

best_model = searcher.best_estimator_
best_model["classifier"].estimator
HistGradientBoostingClassifier(categorical_features=[21, 22, 23, 24, 25, 26, 27,
                                                     28, 29, 30, 33, 34],
                               learning_rate=0.05)
Please rerun this cell to show the HTML repr or trust the notebook.
# Solution

make_submission(best_model, test_features_df).to_csv("submission5.csv")

Vocabulary#

The vocabulary in this tutorial is meant to be consistent with the Scikit-learn documentation, but some of the terms we’ve defined are used differently in other contexts.

For example, we presented the train-validate-test framework like this:

  • The training split is used to fit a model and find the best coefficients;

  • The validation split is used to tune a model and find the best parameters; and

  • The test data is used to evaluate the model on unseen data.

But the things we called coefficients are sometimes called parameters; when they are, the things we called parameters are called hyperparameters.

Also, the names we used for the splits are not universal. What we called the test set is sometimes called the holdout set because it is held out from the model during training. And you might see “validation” and “test” used differently as well.

The inconsistency of these terms can be a source of confusion when you use the Scikit-learn function called train_test_split, because it is not only used to split the training and test data; it is also used to split the training and validation data. In this tutorial, the train-test split test was done for us by the people who set up the competition; then we divided the training data we were given, further, into a training split and a validation split.

In an attempt to avoid confusion, we used “training data” for all of the data we got and “training split” for the part we split off from the validation split. But you are likely to see these terms used interchangeably.

Summary#

In this lesson we explored a new kind of classification algorithm based on random random decision trees. We started with random forests and ended with gradient-boosted trees.

This lesson starts with the training-test framework from the first two lesson and extends it to a training-validation-test framework. In the context of an online competition, this framework has a practical benefit – because it is faster and easier to test models – and a theoretical benefit – because it avoids overfitting due to leaking information from the test data.

And we used tools from Scikit-learn that automate common practices in this framework, including k-fold cross-validation and grid search for the best model parameters.

We hope you have found this tutorial useful. At this point, you have a small but powerful set of tools for working with tabular data and generating probablistic predictions. In future tutorials, we will explore tools for working with other kinds of data – like text, images, and videos – and generating other kinds of predictions.