San Francisco Bikeshare Analysis

Data Wrangling, SQL, Python, Pandas, BigQuery, Jupyter, Matplotlib, Gmaps

June 10, 2018

In this project, I...

  • Use a dataset of 2m San Francisco Bikeshare rides from Google BigQuery.
  • Conduct a series of increasingly complex SQL queries, culminating in a single query 113 lines long.
  • Analyze the results with data visualizations using Matplotlib and Gmaps.

The ostensible goal of this project was an analysis of SF Bikeshare usage patterns for a fictional promotional campaign to drive memberships. In reality, this was primarily a project to play with SQL, with the goal of crafting the most complex single SQL query I could think of for this dataset: For bikes that start the day in commuter docks, what percentage of rides for the rest of that day are between two commuter stations?

In order to accomplish this, I would need to...

  • Identify which docks were commuter docks
  • Plot commuter stations and routes on a map
  • Identify bikes that start the day as commuter bikes, and trace their location through the day

This is an edited version of the final Jupyter Notebook. For the sake of brevity, I have trimmed some utility code, redundant morning/evening queries and cut the Tourist portion of the analysis, which largely reiterates the Commuter analysis. You can see the full original here.

Objective

For this analysis, I want to look at the bikeshare data to determine usage patterns in order to craft promotional campaigns to drive usage. Specifically, I will examine the feasibility of using the bikes themselves as a means of delivering differentiated promotions to targeted user groups.


    SELECT
      EXTRACT(HOUR
      FROM
        start_date) AS hour,
      COUNT(*)
    FROM
      `bigquery-public-data.san_francisco.bikeshare_trips`
    WHERE
      CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
      AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
    GROUP BY
      hour
    ORDER BY
      hour ASC

    SELECT
      EXTRACT(HOUR
      FROM
        start_date) AS hour,
      COUNT(*)
    FROM
      `bigquery-public-data.san_francisco.bikeshare_trips`
    WHERE
      CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) = 0
      OR CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) = 6
    GROUP BY
      hour
    ORDER BY
      hour ASC
            

    weekday_trips=pd.read_csv('weekday_trips.csv', header=0, names=['hour', 'weekday'], index_col=0)
    weekend_trips=pd.read_csv('weekend_trips.csv', header=0, names=['hour', 'weekend'], index_col=0)
    weekday_avg, weekend_avg = weekday_trips/5, weekend_trips/2
    trips = pd.concat([weekday_avg, weekend_avg], axis=1, join_axes=[weekday_avg.index])
    trips.plot.bar(figsize=(10,6), title="Avg. Trips per hour on Weekday/Weekend", width=1.5)
            
Hourly Usage

We can see that on weekdays, usage is clearly bimodal with peaks corresponding to commute times between 7-10am and 4-7pm. Weekend usage appears largely normal, with peak at 1pm but a broad standard deviation on either side. It is also far smaller than weekend usage, with peak usage approximately equal to the lowest lull on weekday afternoons, and less than 1/5th of peak weekday usage.

Commuters

Clearly the predominate use case for GoBikes is commuting, so we'll try to understand more about about these riders. If we want to target these riders, we'll need to know where they are. To that end, we'll determine What are the most popular departure docks for commuters on weekday morning and evenings, and are they subscribers or single-use riders? (here defined as 6-9am and 4-7pm, respectively). Are there just a few heavily-used docks, or are they more evenly-distributed?


    SELECT
      subs.start_station_name as station,
      subs.subcount,
      custs.custcount,
      subs.subcount+custs.custcount as total_rides,
      subs.subcount/(subs.subcount+custs.custcount) as subratio
    FROM
      (SELECT
        start_station_name,
        COUNT(subscriber_type) as subcount
      FROM
        `bigquery-public-data.san_francisco.bikeshare_trips`
      WHERE
        TIME(start_date) BETWEEN '06:00:00'AND '09:00:00'
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
        AND subscriber_type = 'Subscriber'
      GROUP BY
        start_station_name) as subs,
      (SELECT
        start_station_name,
        COUNT(subscriber_type) as custcount
      FROM
        `bigquery-public-data.san_francisco.bikeshare_trips`
      WHERE
        TIME(start_date) BETWEEN '06:00:00'AND '09:00:00'
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
        AND subscriber_type = 'Customer'
      GROUP BY
        start_station_name) as custs
    WHERE
      subs.start_station_name = custs.start_station_name
    ORDER BY
      total_rides DESC
    LIMIT 20
            
station subcount custcount total_rides subratio
0 San Francisco Caltrain (Townsend at 4th) 33017 715 33732 0.978804
1 San Francisco Caltrain 2 (330 Townsend) 21629 405 22034 0.981619
2 Harry Bridges Plaza (Ferry Building) 16708 543 17251 0.968524
3 Temporary Transbay Terminal (Howard at Beale) 15952 169 16121 0.989517
4 Steuart at Market 10175 279 10454 0.973312
5 Grant Avenue at Columbus Avenue 8435 186 8621 0.978425
6 2nd at Townsend 7090 148 7238 0.979552
7 Embarcadero at Bryant 5849 101 5950 0.983025
8 Market at 10th 5671 174 5845 0.970231
9 Townsend at 7th 5467 77 5544 0.986111
Most Popular Morning Stations Most Popular Evening Stations

Answer: We can clearly see that in the morning, the bulk of rides originate from 4 or 5 stations, all of which are debarkation points for trains or ferries. In the evening, however, the distribution is considerably more even. If we view the station activity as a heatmap, we can see the departure hotspots, as well as an appreciation that there is more activity in the morning than in the evening.


    SELECT latitude, longitude FROM `bigquery-public-data.san_francisco.bikeshare_stations`

    SELECT
      trips.start_station_id,
      stations.latitude,
      stations.longitude,
      COUNT(*) AS weight
    FROM
      `bigquery-public-data.san_francisco.bikeshare_trips` AS trips
    LEFT JOIN
      `bigquery-public-data.san_francisco.bikeshare_stations` AS stations
    ON
      trips.start_station_id = stations.station_id
    WHERE
      TIME(trips.start_date) BETWEEN '06:00:00'AND '09:00:00'
      AND CAST(FORMAT_DATE('%w', DATE(trips.start_date)) AS int64) > 0
      AND CAST(FORMAT_DATE('%w', DATE(trips.start_date)) AS int64) < 6
    GROUP BY
      trips.start_station_id,
      stations.latitude,
      stations.longitude
            

    fig = gmaps.figure(center=(37.785908,-122.39547), zoom_level=13)
    commute_heat_layer = gmaps.heatmap_layer(csm[['latitude','longitude']], weights=csm['weight'])
    all_stations_layer = gmaps.symbol_layer(stations[['latitude','longitude']], fill_color="black", scale=1)
    commute_heat_layer.max_intensity = 5000
    commute_heat_layer.point_radius = 25
    fig.add_layer(commute_heat_layer)
    fig.add_layer(all_stations_layer)
    fig
            
Morning Commute Stations

    fig = gmaps.figure(center=(37.785908,-122.39547), zoom_level=13)
    commute_heat_layer = gmaps.heatmap_layer(cse[['latitude','longitude']], weights=csm['weight'])
    all_stations_layer = gmaps.symbol_layer(stations[['latitude','longitude']], fill_color="black", scale=1)
    commute_heat_layer.max_intensity = 5000
    commute_heat_layer.point_radius = 25
    fig.add_layer(commute_heat_layer)
    fig.add_layer(all_stations_layer)
    fig
            
Evening Commute Stations

To consider the bikes as advertising platforms, however, we need to know where they are when they aren't in a dock. For that, we should look at routes, rather than departure points. What are the 20 most popular commute routes? Defined here as a matched pair of stations used both morning and afternoon.


    SELECT
      morning.start_station_id AS morn_start,
      morning.end_station_id AS morn_end,
      evening.start_station_id AS eve_start,
      evening.end_station_id AS eve_end,
      CONCAT(morning.start_station_name, ' / ', morning.end_station_name)  AS route,
      mornstations.longitude AS morning_long,
      mornstations.latitude AS morning_lat,
      evestations.longitude AS evening_long,
      evestations.latitude AS evening_lat,
      morning.trips
    FROM (
      SELECT
        start_station_id,
        end_station_id,
        start_station_name,
        end_station_name,
        COUNT(*) as trips
      FROM
        `bigquery-public-data.san_francisco.bikeshare_trips`
      WHERE
        TIME(start_date) BETWEEN '07:00:00'AND '09:00:00'
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
      GROUP BY
        start_station_id,
        end_station_id,
        start_station_name,
        end_station_name) AS morning,
      (
      SELECT
        start_station_id,
        end_station_id
      FROM
        `bigquery-public-data.san_francisco.bikeshare_trips`
      WHERE
        TIME(start_date) BETWEEN '16:00:00' AND '18:00:00'
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
        AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
      GROUP BY
        start_station_id,
        end_station_id) AS evening
    LEFT JOIN
      `bigquery-public-data.san_francisco.bikeshare_stations` AS mornstations
    ON
      morning.start_station_id = mornstations.station_id
    LEFT JOIN
      `bigquery-public-data.san_francisco.bikeshare_stations` AS evestations
    ON
      evening.start_station_id = evestations.station_id
    WHERE
      morning.start_station_id = evening.end_station_id
      AND morning.end_station_id = evening.start_station_id
    ORDER BY
      morning.trips DESC
    LIMIT 20
            
Most Popular Commute Routes

    figure_layout = {'width': '1000px', 'height': '600px'}
    fig = gmaps.figure(center=(37.786978,-122.39547), zoom_level=14, layout=figure_layout)

    for index, row in routes.head(10).iterrows():
        start = (row['morning_lat'], row['morning_long'])
        end = (row['evening_lat'], row['evening_long'])
        route = gmaps.directions_layer(start, end, show_markers=False, stroke_color='red')
        all_stations_layer = gmaps.symbol_layer(stations[['latitude','longitude']], fill_color="black", scale=2)
        fig.add_layer(route)
        fig.add_layer(all_stations_layer)
    fig
            
Commuter Routes

We see that by far the most common route is from Montgomery BART down 2nd to South Park. The other routes follow a similar pattern - from transit hubs to outlying business areas which are otherwise inaccessible. Most if not all of these routes are in heavy traffic corridors both for cars and pedestrians, so a promotional display on bikes along these routes that was tailored to commuters who are otherwise sitting in traffic or hoofing it to work could be effective. However, there is no guarantee that a bike placed at one of these docks at the start of the day would STAY in commuter territory--it could easily wander off with other riders. Which begs the question: For bikes that start the day in commuter docks, what percentage of rides for the rest of that day are between two commuter stations?


    SELECT
        Matching_table.bike_number AS Bike_num,
        DATE(Matching_table.start_date) AS Ride_date,
        COUNT(Matching_table.Startmatch) AS Matches,
        Matching_table.Totalrides,
        COUNT(Matching_table.Startmatch)/Matching_table.Totalrides AS Ratio
      FROM (
        SELECT
          bike_journeys.*,
          commuter_stations_1.start_station_name AS ref_start,
          commuter_stations_2.start_station_name AS ref_end,
          (CASE
              WHEN (commuter_stations_1.start_station_name IS NOT NULL
              AND commuter_stations_2.start_station_name IS NOT NULL)
                THEN COUNT(commuter_stations_1.start_station_name) OVER (PARTITION BY DATE(start_date),
              bike_number) END) AS Startmatch,
          COUNT(trip_id) OVER (PARTITION BY DATE(start_date),
            bike_number) AS Totalrides
        FROM (
          SELECT
            DISTINCT(all_trips.trip_id),
            all_trips.bike_number,
            all_trips.start_date,
            all_trips.start_station_name,
            all_trips.end_station_name
          FROM
            `bigquery-public-data.san_francisco.bikeshare_trips` AS all_trips,
            (
            SELECT
              DISTINCT(bike_number),
              start_date
            FROM (
              SELECT
                start_station_name,
                COUNT(*) AS trips
              FROM
                `bigquery-public-data.san_francisco.bikeshare_trips`
              WHERE
                (TIME(start_date) BETWEEN '07:00:00'AND '10:00:00' OR TIME(start_date) BETWEEN '16:00:00'AND '19:00:00')
                AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
                AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
              GROUP BY
                start_station_name
              ORDER BY
                trips DESC
              LIMIT
                20) AS commuter_stations,
              (
              SELECT
                bike_number,
                start_date,
                start_station_name
              FROM
                `bigquery-public-data.san_francisco.bikeshare_trips`) AS all_bikes
            WHERE
              TIME(start_date) BETWEEN '07:00:00'AND '08:00:00'
              AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
              AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
              AND commuter_stations.start_station_name = all_bikes.start_station_name
            ORDER BY
              start_date ASC) AS start_bikes
          WHERE
            all_trips.bike_number = start_bikes.bike_number
              AND CAST(FORMAT_DATE('%w', DATE(all_trips.start_date)) AS int64) > 0
              AND CAST(FORMAT_DATE('%w', DATE(all_trips.start_date)) AS int64) < 6
              AND DATE(start_bikes.start_date) = DATE(all_trips.start_date)
          ORDER BY
            all_trips.start_date,
            all_trips.bike_number) AS bike_journeys
        LEFT JOIN (
          SELECT
            start_station_name,
            COUNT(*) AS trips
          FROM
            `bigquery-public-data.san_francisco.bikeshare_trips`
          WHERE
            (TIME(start_date) BETWEEN '07:00:00'AND '10:00:00' OR TIME(start_date) BETWEEN '16:00:00'AND '19:00:00')
            AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
            AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
          GROUP BY
            start_station_name
          ORDER BY
            trips DESC
          LIMIT
            20) AS commuter_stations_1
        ON
          bike_journeys.start_station_name = commuter_stations_1.start_station_name
        LEFT JOIN (
          SELECT
            start_station_name,
            COUNT(*) AS trips
          FROM
            `bigquery-public-data.san_francisco.bikeshare_trips`
          WHERE
            (TIME(start_date) BETWEEN '07:00:00'AND '10:00:00' OR TIME(start_date) BETWEEN '16:00:00'AND '19:00:00')
            AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) > 0
            AND CAST(FORMAT_DATE('%w', DATE(start_date)) AS int64) < 6
          GROUP BY
            start_station_name
          ORDER BY
            trips DESC
          LIMIT
            20) AS commuter_stations_2
        ON
          bike_journeys.end_station_name = commuter_stations_2.start_station_name
        ORDER BY
          DATE(bike_journeys.start_date),
          bike_journeys.bike_number ) AS Matching_table
      GROUP BY
        DATE(Matching_table.start_date),
        Matching_table.bike_number,
        Matching_table.Totalrides
            
Time in Commuter Territory

    commute_territory_avg = sum(commute_territory['Matches'])/sum(commute_territory['Totalrides'])
    print("For bikes that begin the day at a commuter dock, {:.1f}% of their subsequent routes are between commuter docks."
          .format(commute_territory_avg*100))
    commute_legs_day = sum(commute_territory['Matches'])/785 # There are 785 weekdays in the database
    commute_legs_day_bike = commute_legs_day/54 # Average number of unique bikes in service during commute hours on weekdays
    print("On average, there are {:.0f} commute legs completed per weekday.  That works out to about {:.2f} routes per bike per day."
         .format(commute_legs_day, commute_legs_day_bike))
            

For bikes that begin the day at a commuter dock, 62.1% of their subsequent routes are between commuter docks.
On average, there are 158 commute legs completed per weekday. That works out to about 2.93 routes per bike per day.

Commuter Analysis and Recommendation

For a bike that starts the day at 7am in a commuter dock, 3 out of 5 rides that day will be between commuter waypoints, making it a fairly effective vehicle (as it were) for a promotion. However, not many bikes will make it to 5 rides - the average bike only makes about 4 trips per day, and ~2.5 of those are between commuter stops.

A promotion aimed at getting commuters to sign up for membership could be displayed on the side of bikes at the most popular commuter docks. Something along the lines of "If you lived here, you'd be home by now" but for inspiring envy of those getting to work/home faster.