Data Science for Good: PASSNYC

Machine Learning, SK-Learn, Logistic Regression, Visualization, Python, Jupyter, Kaggle

August 14, 2018

In this project, I...

  • Collaborate with teammates to combine a Kaggle dataset with public data on NYC schools and conduct an EDA.
  • Build and tune a model to predict minority admissions using Logistic Regression in Scikit-Learn.
  • Conduct a post-hoc analysis of features with data visualizations in Matplotlib and Seaborn.

PASSNYC is an organization dedicated to increasing minority enrollment at New York's elite specialized high schools. They partnered with Kaggle for a Data For Good competition to see if they could improve their targeting of and engagement with underrepresented middle schools within the five boroughs. Although we were disqualified from entry because we had a Google employee on our team (Google owns Kaggle), we elected to attempt the competition regardless and simply passed along the results to PASSNYC to review in case they found anything useful.

It was an ambitious group project, covering substantial data manipulation (both Pandas and Numpy) as well as incorporating and analyzing four classifiers - K-Nearest Neighbors, Random Forests, Logistic Regression and Neural Nets. My specific focus was Logistic Regression, and I'm particularly proud of the post-hoc analysis that involved (especially the heatmaps). However, we all worked together and approved all aspects, so I wasn't working in isolation.