Data Analysis: NYC Bikeshare and Parking Tickets
Data Wrangling, EDA, Python, Pandas, Plotly, Gmaps, Geocoding
August 14, 2018
In this project, I...
- Gather, filter, clean and merge two datasets of over 30m records with Pandas.
- Build a pipeline to Geocode each record with a latitude and longitude to a particular NYC neighborhood.
- Analyze the results with visualizations using Plotly and Gmaps.
This was a project to demonstrate understanding of data management, specifically using Pandas. Although this was a group project, I have included below only the portion of the project that I worked on. Separate analyses from my teammates of parking tickets and Citibike in isolation are omitted for the sake of focus and brevity.
The purpose of this project is to offer an analysis of available public data on CitiBike (NYC) bikeshare usage and New York parking violations. These two datasets will be analyzed separately for information particular to each of them, and then in tandem for a specific neighborhood in Brooklyn that installed CitiBike docks in August 2016 to see whether and how the installation of docks had any effect on parking violations.
New York is a famously difficult place to find parking. The city writes about 10 million parking tickets a year (~6 million to passenger cars), despite only 45% of households having a vehicle (that’s an average of 4 tickets per household with vehicle per year). A third of these tickets occur in Manhattan. Total annual income to the city from parking tickets alone is over $400m, which sounds like a lot but is small compared to the subway’s annual farebox revenue of $6.2B or the MTA toll revenue of $2B.
The parking nightmare is a function of the traffic. New York is the third-worst city for congestion in the world (tied with Moscow at 91 hours per year stuck in peak-hour traffic for the average driver, beat out only by Los Angeles). Nobody beats New Yorkers for economic cost, though, as the value of fuel, time, freight and business fees add up to a $16.9B cost to the city.
In an effort to address congestion, in 2013 New York partnered with private company Motivate and sponsor Citigroup to introduce CitiBike, one of the first and largest city wide public bikeshare programs in the world. Starting with 332 stations and 6,000 bikes, the program was far more popular than expected (70k members in the first month alone), and CitiBike scrambled to keep up with demand. Annual expansions have increased the footprint to 706 stations carrying 12,000 bikes, and currently CitiBike is working on implementing a dockless bikeshare system (like Spin in San Francisco) that would have no set geographic boundaries.
In August 2016, CitiBike expanded into Community District 306 in Brooklyn, a diverse set of neighborhoods including posh Park Slope and industrial Gowanus. Notably it also includes Red Hook, one of the rare bywaters in New York with no subway access.
Our hypothesis is that the introduction of CitiBike into a neighborhood should reduce parking violations, as it provides an alternative to driving, particularly in transit-starved areas like Red Hook.
Our data comes from several sources:
- CitiBike System Data
- NYC Parking Tickets (FY2013-FY2017, FY2018
- NYC Political and Administrative Districts Metadata
- NYC Parking Violation Codes and Fines
- Geocoding API by the State of New York
Getting the subset of data to run an analysis on a particular neighborhood (Brooklyn Community District 6--coded as CD306) required several steps.
- Parking Data for FY2016, FY 2017, and FY2018 was imported into three separate dataframes, dates parsed by a special function and NaN Issue Dates imputed to a date that would later be filtered out. At over 30M records, this took a while.
- The three dataframes were concatenated and deduped, then filtered in stages by dates one year on either side of CitiBike dock installation (21.5M records), by Kings County (ie, Brooklyn--4.4M records), and then by a hand-constructed dataframe of street names in CD306. This gave us a set of 750k records that at least shared a county and street with the target district, but since a number of those streets are very long, and not necessarily unique in Brooklyn, we couldn’t say definitively they belonged
- A new Address column combining House Number and Street Name with city and state was added and then used to dedupe a new dataframe, making a new subset of ~40k unique addresses. This was fragmented into 8 CSV files of 5000 records each.
- The files were run in batches through a geocoding services API provided by the state of New York which returned a latitude and longitude as a JSON response to an http query. These coordinates were added to each CSV, and then all the fragments were loaded and rejoined to recreate the dataframe of unique addresses, this time with ‘lat’ and ‘lon’ columns. The hit rate was quite good, only failing to return a coordinate pair on a little over a thousand addresses.
- Using Geopandas and Shapely, these records’ coordinates were compared against the Shapefile of CD306 and flagged either ‘1’ or ‘0’ in a new ‘target’ column to set apart the ones that were definitively in the target zone. (This same function was used on the CitiBike data to pull out the target district docks).
- A left merge of the unique addresses back into the original dataframe of parking tickets on ‘Address’, followed by filtering out all the ‘0’ targets, yielded a final dataset of 253k parking tickets in CD306 from September 1, 2015 to August 31, 2016.
- As an additional step, Pandas’ read_html function was used to scrape the tables of parking violation codes and fines from the web. After some cleanup, this was merged into the final dataset to create a “fine” column that contained the putative value of each ticket.
CitiBike/Parking Ticket Analysis
Did the installation of CitiBike docks in Brooklyn Community District 306 have an effect on parking violations? In the year preceding installation, New York issued 128,209 citations in the district, whereas in the year following, it issued 124,705. The 3,504 difference represents a drop of only 3%, so if there was an effect, it was not dramatic. A look at the weekly volume of parking tickets sheds a little more light on the story
Apart from the relatively quiet period in the summer of 2016, it looks like the pace of tickets in 2016 was generally higher than 2017. You can also see the effect of a harsher winter in 2017, when snowstorms caused the pace of ticketing to drop for one week in 2016 but multiple times in 2017 (generally the police are more lenient about residential street parking violations when the cars are buried under snow). It is curious that ticketing dropped in summer of 2016 while that doesn't appear to have happened in 2015 or 2017. Having lived in that neighborhood that summer, I can’t think of a reason why parking ticket issuance would have been depressed in those months.
What about the type of parking violations, before vs. after? Let's look at the violation codes for each period.
A few more street cleaning and sticker violations, but remarkably fewer meter violations - likely a result of fewer metered parking spots where the docks displaced street parking. This begs the question: did installing these docks change the city's income from parking violations in this area? A summary analysis of fines indicates that 2016 revenue in this district was $7,516,395, while 2017 revenue was $7,551,640. So revenue went UP by $35,245, or ~0.5%, despite slightly fewer tickets. Not a big change, and overlaying weekly revenue on top of weekly volume confirms this: on a revenue basis, it appears the CitiBike installation may have lowered meter violations, but parking enforcement made up for it elsewhere.
The graphs look nearly identical, indicating that the mix of tickets (as defined by fine amount) remained largely consistent over the two years, or at least didn't shift between high tickets and low tickets. Given the bulk of tickets goes to minor offenses like street cleaning, that's not surprising.
Just out of curiosity, what about incidental characteristics like vehicle color, make and model? We wouldn't expect to see a meaningful change in these, but since we have the data, might as well look:
As expected, no dramatic differences in these variables from one year to the next.
What about geographic distribution of the parking violations? For this, we need a heatmap comparing where the most parking violations occurred before and after the bike dock installation (black dots are the location of Citibike docks).
If CitiBike installation had an effect on parking tickets in New York’s Community District 306, it was too small an effect for us to measure. While the number of meter violations appears to have dropped noticeably, that did not have an effect on parking ticket income. Incidental characteristics of vehicle color, make and body type are not changed, and geographic distribution looks very similar--if anything, parking violations in Red Hook increased in 2017. There is insufficient evidence to accept our hypothesis that the introduction of CitiBike eases parking.