Kaggle: Click-through Rate Prediction

Data Wrangling, Spark, Python, Pandas, EDA, Logistic Regression

April 18, 2019

In this project, I...

  • Conduct an Exploratory Data Analysis of a 45m-record dataset of ad displays.
  • Enrich and optimize the dataset using feature engineering and selection.
  • Train a distributed classifier in Spark Dataframes using Logistic Regression.

This was a project to demonstrate understanding of distributed machine learning at scale using Spark. The basis of the project was a Kaggle competition from 2014 to predict click-through rates on advertisements displayed on Criteo's ad network. The data provided for the competition was a week's worth of anonymized display advertisement data, approximately 45 million rows.

This was a group project, and my team elected to use logistic regression, implemented via SparkML and Spark Dataframes. Although we leveraged the SparkML package, we demonstrate detailed understanding of the mechanism of logistic regression and gradient descent. Within the group, my responsibilities were to conduct a detailed EDA of the data, to provide some feature enhancement and engineering via the creation of a 'daypart' feature to capture temporal signal, and to implement an optimizer for feature selection.

See the full project notebooks here.