Introduction to Data Competitions#

This repository contains Jupyter notebooks and other material for an introduction to machine learning competitions using Scikit-learn. As a motivating exercise, we will participate in a competition hosted by DrivenData, called Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines.

The goal of this competition is to use survey data to predict which people chose to get vaccinated against H1N1, which is the influenza virus that caused the 2009 swine flu pandemic.

The dataset we will use is from the National 2009 H1N1 Flu Survey, which ran from October 2009 through June 2010. As we explore the data, we will get a better sense of the information it contains and how it can be useful. The primary tools we will use are:

  • Pandas for reading and exploring data, and

  • Scikit-learn for processing data, making models, and generating predictions.

In the first lesson, we will use a relatively simple algorithm to generate predictions, submit the predictions to the competition site, and get a score. In the subsequent lessons we will gradually add new features and improve the model. The third lesson ends with an open-ended exercise where you can explore on your own and see if you can move up the leaderboard.

Prerequisites

This tutorial should be accessible to people familiar with intermediate-level Python including NumPy and Pandas, but no prior experience with Scikit-learn is required. And you don’t have to know anything about machine learning.

Notebooks

Run Notebook 1 on Colab

Run Notebook 2 on Colab

Run Notebook 3 on Colab