This analytics case study is related to the Capstone project requirements for Google Data Analytics Professional Certificate. The case study contains data for a bike sharing company oon customer trip details over a 12-month period (July 2021 to June 2022). Data provided by Motivate International Inc. under this license. I admit i took a lot of inspiration from Anantharaman
Marketing team needs to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ.
Hence, my objective for this analysis is to show and give insights about the two types of customers: annual members and casual riders, use Cyclistic bikeshare differently, based on few parameters that was given from existing data.
A total of 12 datasets have been made available for each month starting from July 2021 - June 2022. Each dataset captures the details of every ride logged by the customers of Cyclistic. Unfortunately to made this data publicly available, some details included personal information about the customer have been omitted.
The total size of all 12 records exceeds 1GB. Due to the size of the data, the spreadsheet cleaning process can be time consuming and slower than SQL or R. I chose R because it’s open source and already has notebook functionality built in. This is also a good time to practice and learn the R programming language. The application used is Rstudio Desktop.
tidyverse for data import and wrangling, lubridate for date functions, ggplot for visualization, dplyr for grammar of manipulating data, readr for read rectangular text data, janitor for examining and cleaning dirty data, data.table extension of data.frame, tidyr for tidy messy data and mapview for interactive viewing of spatial data in R
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(dplyr)
library(readr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(tidyr)
library(mapview)
#===================== # STEP 1: COLLECT DATA #===================== #
Load The data Time to input the data we need. To input the data, i use:
jul2021 <- read_csv("202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug2021 <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep2021 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct2021 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov2021 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec2021 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan2022 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb2022 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar2022 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr2022 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may2022 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun2022 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#==================================================== #
STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE
#==================================================== #
Compare column names each of the files While the names don’t have to be in the same order, they do need to match perfectly before we can use a command to join them into one file
colnames(jul2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(feb2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Before transforming and cleaning the data we need to combine our trip data into one single data frame. To do that, I must first check the attributes of the data to find name differences and inconsistencies
str(jul2021)
## spec_tbl_df [822,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
## $ ended_at : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
## $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:822410] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(aug2021)
## spec_tbl_df [804,352 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
## $ ended_at : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
## $ start_station_name: chr [1:804352] NA NA NA NA ...
## $ start_station_id : chr [1:804352] NA NA NA NA ...
## $ end_station_name : chr [1:804352] NA NA NA NA ...
## $ end_station_id : chr [1:804352] NA NA NA NA ...
## $ start_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:804352] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(sep2021)
## spec_tbl_df [756,147 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
## $ ended_at : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
## $ start_station_name: chr [1:756147] NA NA NA NA ...
## $ start_station_id : chr [1:756147] NA NA NA NA ...
## $ end_station_name : chr [1:756147] NA NA NA NA ...
## $ end_station_id : chr [1:756147] NA NA NA NA ...
## $ start_lat : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:756147] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(oct2021)
## spec_tbl_df [631,226 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
## $ ended_at : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
## $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:631226] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:631226] NA NA NA NA ...
## $ end_station_id : chr [1:631226] NA NA NA NA ...
## $ start_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:631226] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(nov2021)
## spec_tbl_df [359,978 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:359978], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
## $ ended_at : POSIXct[1:359978], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
## $ start_station_name: chr [1:359978] NA NA NA NA ...
## $ start_station_id : chr [1:359978] NA NA NA NA ...
## $ end_station_name : chr [1:359978] NA NA NA NA ...
## $ end_station_id : chr [1:359978] NA NA NA NA ...
## $ start_lat : num [1:359978] 41.9 42 42 41.9 41.9 ...
## $ start_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num [1:359978] 42 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr [1:359978] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(dec2021)
## spec_tbl_df [247,540 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:247540] "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr [1:247540] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:247540], format: "2021-12-07 15:06:07" "2021-12-11 03:43:29" ...
## $ ended_at : POSIXct[1:247540], format: "2021-12-07 15:13:42" "2021-12-11 04:10:23" ...
## $ start_station_name: chr [1:247540] "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr [1:247540] "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr [1:247540] "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr [1:247540] "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num [1:247540] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:247540] -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:247540] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:247540] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:247540] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(jan2022)
## spec_tbl_df [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:103770] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:103770] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:103770] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(feb2022)
## spec_tbl_df [115,609 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:115609] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:115609] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:115609], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:115609], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:115609] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:115609] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:115609] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:115609] "13179" "TA1307000113" "13011" "13323" ...
## $ start_lat : num [1:115609] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:115609] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:115609] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:115609] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:115609] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(mar2022)
## spec_tbl_df [284,042 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:284042] "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr [1:284042] "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:284042], format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" ...
## $ ended_at : POSIXct[1:284042], format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" ...
## $ start_station_name: chr [1:284042] "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr [1:284042] "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr [1:284042] "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr [1:284042] "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr [1:284042] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(apr2022)
## spec_tbl_df [371,249 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:371249] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
## $ rideable_type : chr [1:371249] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:371249], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
## $ ended_at : POSIXct[1:371249], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
## $ start_station_name: chr [1:371249] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
## $ start_station_id : chr [1:371249] "515" "13075" "TA1307000121" "13075" ...
## $ end_station_name : chr [1:371249] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
## $ end_station_id : chr [1:371249] "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
## $ start_lat : num [1:371249] 42 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:371249] 42.1 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:371249] "member" "member" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(may2022)
## spec_tbl_df [634,858 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:634858] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
## $ rideable_type : chr [1:634858] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:634858], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" ...
## $ ended_at : POSIXct[1:634858], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" ...
## $ start_station_name: chr [1:634858] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
## $ start_station_id : chr [1:634858] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
## $ end_station_name : chr [1:634858] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
## $ end_station_id : chr [1:634858] "TA1309000025" "15534" "13221" "TA1305000030" ...
## $ start_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:634858] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:634858] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr [1:634858] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(jun2022)
## spec_tbl_df [769,204 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:769204] "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ...
## $ rideable_type : chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:769204], format: "2022-06-30 17:27:53" "2022-06-30 18:39:52" ...
## $ ended_at : POSIXct[1:769204], format: "2022-06-30 17:35:15" "2022-06-30 18:47:28" ...
## $ start_station_name: chr [1:769204] NA NA NA NA ...
## $ start_station_id : chr [1:769204] NA NA NA NA ...
## $ end_station_name : chr [1:769204] NA NA NA NA ...
## $ end_station_id : chr [1:769204] NA NA NA NA ...
## $ start_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
## $ start_lng : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
## $ end_lng : num [1:769204] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:769204] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
all_trips <- bind_rows(jul2021, aug2021, sep2021, oct2021, nov2021, dec2021, jan2022, feb2022, mar2022, apr2022, may2022, jun2022)
str(all_trips)
## spec_tbl_df [5,900,385 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5900385] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:5900385] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5900385], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
## $ ended_at : POSIXct[1:5900385], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
## $ start_station_name: chr [1:5900385] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:5900385] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:5900385] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:5900385] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:5900385] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Then To improve readability, I change some column names.
all_trips <- all_trips %>%
rename(ride_type = rideable_type,
start_time = started_at,
end_time = ended_at,
customer_type = member_casual)
glimpse(all_trips)
## Rows: 5,900,385
## Columns: 13
## $ ride_id <chr> "0A1B623926EF4E16", "B2D5583A5A5E76EE", "6F264597DD…
## $ ride_type <chr> "docked_bike", "classic_bike", "classic_bike", "cla…
## $ start_time <dttm> 2021-07-02 14:44:36, 2021-07-07 16:57:42, 2021-07-…
## $ end_time <dttm> 2021-07-02 15:19:58, 2021-07-07 17:16:09, 2021-07-…
## $ start_station_name <chr> "Michigan Ave & Washington St", "California Ave & C…
## $ start_station_id <chr> "13001", "17660", "SL-012", "17660", "17660", "1766…
## $ end_station_name <chr> "Halsted St & North Branch St", "Wood St & Hubbard …
## $ end_station_id <chr> "KA1504000117", "13432", "KA1503000044", "13196", "…
## $ start_lat <dbl> 41.88398, 41.90036, 41.86038, 41.90036, 41.90035, 4…
## $ start_lng <dbl> -87.62468, -87.69670, -87.62581, -87.69670, -87.696…
## $ end_lat <dbl> 41.89937, 41.88990, 41.89017, 41.89456, 41.88659, 4…
## $ end_lng <dbl> -87.64848, -87.67147, -87.62619, -87.65345, -87.658…
## $ customer_type <chr> "casual", "casual", "member", "member", "casual", "…
Next I am adding some new column to the data.frame
all_trips$day_of_the_week <- format(as.Date(all_trips$start_time),'%a')
all_trips$month <- format(as.Date(all_trips$start_time), "%b_%y")
all_trips$time <- format(all_trips$start_time, format = "%H:%M")
all_trips$time <- as.POSIXct(all_trips$time, format = "%H:%M")
all_trips$trip_duration <- (as.double(difftime(all_trips$end_time, all_trips$start_time)))/60
glimpse(all_trips)
## Rows: 5,900,385
## Columns: 17
## $ ride_id <chr> "0A1B623926EF4E16", "B2D5583A5A5E76EE", "6F264597DD…
## $ ride_type <chr> "docked_bike", "classic_bike", "classic_bike", "cla…
## $ start_time <dttm> 2021-07-02 14:44:36, 2021-07-07 16:57:42, 2021-07-…
## $ end_time <dttm> 2021-07-02 15:19:58, 2021-07-07 17:16:09, 2021-07-…
## $ start_station_name <chr> "Michigan Ave & Washington St", "California Ave & C…
## $ start_station_id <chr> "13001", "17660", "SL-012", "17660", "17660", "1766…
## $ end_station_name <chr> "Halsted St & North Branch St", "Wood St & Hubbard …
## $ end_station_id <chr> "KA1504000117", "13432", "KA1503000044", "13196", "…
## $ start_lat <dbl> 41.88398, 41.90036, 41.86038, 41.90036, 41.90035, 4…
## $ start_lng <dbl> -87.62468, -87.69670, -87.62581, -87.69670, -87.696…
## $ end_lat <dbl> 41.89937, 41.88990, 41.89017, 41.89456, 41.88659, 4…
## $ end_lng <dbl> -87.64848, -87.67147, -87.62619, -87.65345, -87.658…
## $ customer_type <chr> "casual", "casual", "member", "member", "casual", "…
## $ day_of_the_week <chr> "Fri", "Wed", "Sun", "Thu", "Wed", "Thu", "Wed", "S…
## $ month <chr> "Jul_21", "Jul_21", "Jul_21", "Jul_21", "Jul_21", "…
## $ time <dttm> 2023-08-21 14:44:00, 2023-08-21 16:57:00, 2023-08-…
## $ trip_duration <dbl> 35.366667, 18.450000, 17.833333, 15.033333, 19.0500…
The columns I added are day of the week, month, time, and trip duration. i also convert the trip duration to minutes.
The categories of data that needed to be cleaned are negative and zero trip duration, and ride that was categorized as test ride. To do that, i use this R code to find number of row of that categories.
nrow(subset(all_trips,trip_duration <= 0))
## [1] 646
nrow(subset(all_trips, start_station_name %like% "TEST"))
## [1] 0
nrow(subset(all_trips, start_station_name %like% "test"))
## [1] 0
nrow(subset(all_trips, start_station_name %like% "Test"))
## [1] 1
The result was there were 646 negative or 0 trip duration and a ride categorized as Test ride. For the last touch i also make new column describing the duration of the trip. I named it duration description. From my research and experience, I assume that the trip longer than 10 minutes can be considered long trip and vice versa. So let’s begin the cleaning session
all_trips_v2 <- all_trips[!(all_trips$trip_duration <= 0),]
all_trips_v2<- all_trips_v2[!(all_trips_v2$start_station_name %like% "Test"),]
all_trips_v2 <- within(all_trips_v2, duration_desc <- if_else(trip_duration<10, "short", "long"))
all_trips_v2$day_of_the_week <- ordered(all_trips_v2$day_of_the_week, levels=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
all_trips_v2$month <- ordered(all_trips_v2$month, levels=c("Jul_21", "Aug_21", "Sep_21", "Oct_21", "Nov_21", "Dec_21", "Jan_22", "Feb_22", "Mar_22", "Apr_22", "May_22", "Jun_22"))
glimpse(all_trips_v2)
## Rows: 5,899,738
## Columns: 18
## $ ride_id <chr> "0A1B623926EF4E16", "B2D5583A5A5E76EE", "6F264597DD…
## $ ride_type <chr> "docked_bike", "classic_bike", "classic_bike", "cla…
## $ start_time <dttm> 2021-07-02 14:44:36, 2021-07-07 16:57:42, 2021-07-…
## $ end_time <dttm> 2021-07-02 15:19:58, 2021-07-07 17:16:09, 2021-07-…
## $ start_station_name <chr> "Michigan Ave & Washington St", "California Ave & C…
## $ start_station_id <chr> "13001", "17660", "SL-012", "17660", "17660", "1766…
## $ end_station_name <chr> "Halsted St & North Branch St", "Wood St & Hubbard …
## $ end_station_id <chr> "KA1504000117", "13432", "KA1503000044", "13196", "…
## $ start_lat <dbl> 41.88398, 41.90036, 41.86038, 41.90036, 41.90035, 4…
## $ start_lng <dbl> -87.62468, -87.69670, -87.62581, -87.69670, -87.696…
## $ end_lat <dbl> 41.89937, 41.88990, 41.89017, 41.89456, 41.88659, 4…
## $ end_lng <dbl> -87.64848, -87.67147, -87.62619, -87.65345, -87.658…
## $ customer_type <chr> "casual", "casual", "member", "member", "casual", "…
## $ day_of_the_week <ord> Fri, Wed, Sun, Thu, Wed, Thu, Wed, Sat, Fri, Thu, M…
## $ month <ord> Jul_21, Jul_21, Jul_21, Jul_21, Jul_21, Jul_21, Jul…
## $ time <dttm> 2023-08-21 14:44:00, 2023-08-21 16:57:00, 2023-08-…
## $ trip_duration <dbl> 35.366667, 18.450000, 17.833333, 15.033333, 19.0500…
## $ duration_desc <chr> "long", "long", "long", "long", "long", "short", "l…
Notice that I also leave the uncleaned data in record so when i messed up, i can go back to the original data.
I also made new data.frames for the needs of geosphasical visualization
start_coordinate <- data.frame(start_lng = all_trips_v2$start_lng, start_lat = all_trips_v2$start_lat, customer_type = all_trips_v2$customer_type) %>%
drop_na() %>%
filter(customer_type == "casual")
end_coordinate <- data.frame(end_lng = all_trips_v2$end_lng, end_lat = all_trips_v2$end_lat, customer_type = all_trips_v2$customer_type) %>%
drop_na() %>%
filter(customer_type == "casual")
clean_start_coordinate <- distinct(start_coordinate) %>% drop_na()
clean_end_coordinate <- distinct(end_coordinate) %>% drop_na()
all_trips_v2 %>%
group_by(customer_type, month) %>%
summarise(number_of_rides = n()) %>%
drop_na() %>%
arrange(customer_type, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = customer_type)) +
labs(title ="Total trips by customer type Vs. Month") +
theme(axis.text.x = element_text(angle = 90)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
The lowest number of bike ride happened on January 2022 . This coincides with winter season in Chicago. The highest occur on July 2021 which coincides with summer season. Member user uses the bikes most often per month compared to the casual.
# average trip duration per day
all_trips_v2 %>%
group_by(customer_type, day_of_the_week) %>%
summarise(average_trip_duration = mean(trip_duration)) %>%
ggplot(aes(x = day_of_the_week, y = average_trip_duration, fill = customer_type)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
labs(title ="Average trip duration by customer type Vs. Day of the week")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
From this graph, we can conclude that casual users are fewer in number than members, but stay with the service longer than members. A full study needs to be conducted to assess the impact of converting casual users to members, especially in terms of average travel time.
# average trip duration per month
all_trips_v2 %>%
group_by(customer_type, month) %>%
summarise(average_trip_duration = mean(trip_duration)) %>%
ggplot(aes(x = month, y = average_trip_duration, fill = customer_type)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
labs(title ="Average trip duration by customer type Vs. Month") + theme(axis.text.x = element_text(angle = 90))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Same thing here, casual travel longer than members. One conclusion is that the average monthly travel time of member users tends to be about the same as general users.
# demand for bike riding in one day
all_trips_v2 %>%
group_by(customer_type, time) %>%
summarise(number_of_trips = n()) %>%
ggplot(aes(x = time, y = number_of_trips, color = customer_type, group = customer_type)) +
geom_line() +
scale_x_datetime(date_breaks = "1 hour", minor_breaks = NULL,
date_labels = "%H:%M", expand = c(0,0)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title ="Demand over 24 hours of a day", x = "Time of the day")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
In this graph, we can say that the peak demand for regular users and member users peaked at 5:00 pm. You can see that there were two peaks in demand for membership types, not just one. This can be explained if a member her user uses a bicycle to commute from work to home or vice versa.
#ride type vs number of trips
all_trips_v2 %>%
group_by(ride_type, customer_type) %>%
summarise(number_of_trips = n()) %>%
ggplot(aes(x= ride_type, y=number_of_trips, fill= customer_type))+
geom_bar(stat='identity', width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
labs(title ="Ride type Vs. Number of trips")
## `summarise()` has grouped output by 'ride_type'. You can override using the
## `.groups` argument.
The docking bike has very few rides compared to other types of bikes. Assume all variants have a similar number of bikes. This means that docked bikes have fallen out of favor. Unfortunately, I am no longer able to answer the question why the analysis is done beforehand.
all_trips_v2 %>%
group_by(duration_desc, customer_type) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = duration_desc, y = number_of_rides, fill = customer_type)) +
labs(title ="Number of rides grouped by Customer type and Duration description") +
geom_col(width = 0.5) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'duration_desc'. You can override using the
## `.groups` argument.
This graph shows that there are many casual users who are using vehicles for a long time. In fact, nearly all casual users used they ride for 10 minutes or more, compared to member users who tended to spend less time on their bikes.
all_trips_v2_casual <- all_trips_v2[!(all_trips_v2$customer_type == "member"),]
all_trips_v2_casual %>%
group_by(day_of_the_week, customer_type, duration_desc) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = day_of_the_week, y = number_of_rides, fill = duration_desc)) +
labs(title ="Number of rides by casual customer grouped by day") +
geom_col(width = 0.5) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'day_of_the_week', 'customer_type'. You can
## override using the `.groups` argument.
The longest ride by casual rider took place over the weekend.
all_trips_v2_member <- all_trips_v2[!(all_trips_v2$customer_type == "casual"),]
all_trips_v2_member %>%
group_by(day_of_the_week, customer_type, duration_desc) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = day_of_the_week, y = number_of_rides, fill = duration_desc)) +
labs(title ="Number of rides by member customer grouped by day") +
geom_col(width = 0.5) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'day_of_the_week', 'customer_type'. You can
## override using the `.groups` argument.
Most trips by member users were short notice. Over 50% of this user’s trips are short trips. This applies every day except weekends.
mean(start_coordinate$start_lng)
## [1] -87.64712
mean(start_coordinate$start_lat)
## [1] 41.90202
mean(end_coordinate$end_lng)
## [1] -87.64738
mean(end_coordinate$end_lat)
## [1] 41.90238
longitude <- c(mean(start_coordinate$start_lng), mean(end_coordinate$end_lng))
latitude <- c(mean(start_coordinate$start_lat), mean(end_coordinate$end_lat))
coordinates <- data.frame(longitude, latitude)
mapview(coordinates, xcol = "longitude" , ycol = "latitude", crs = 4269, grid = FALSE)
Next, visualize the geospatial data starts and stops to see where they start and stop most often. Unfortunately, I don’t have the computational resources to run 5 million coordinate points, so I decided to plot the most common start and end points at their average. I found that the average start and stop coordinates happened to be near North Kingsbury Street.
This capstone project provides a comprehensive analysis of bike sharing. I found:
Casual types tend to travel longer than members. However, members are taking more trips per month.
Demand peaked around 5:00 pm and was for casual users.
Casual users tend to use it on weekends, but members tend to use it on weekdays.
casual users, although fewer in number, use the service more often and longer than member users.
July was a popular month for bike touring. This coincides with the summer months when people are looking for the fastest way to get from one point to another. Instead of waiting for the bus and feeling the scorching heat, you can always grab your bike and start your journey.
The data should be further visualized using a geospatial visualization tool with sufficient computational resources to perform further data analysis of the geopoints.
Monthly cycle usage trends are similar for casual and members.
The docked bike was ridden less than the other bikes and mysteriously the members stopped using it.
Recommendations
Provide attractive promotions to casual riders on weekdays so that her one-time-use member per week can use the bikeshare service consistently throughout her week To do.
Membership fee will be discounted for renewal after the first year. Casual drivers may participate.
Offer discounts on off-peak fares to help casual riders ride their bikes more often and even out demand throughout the day.
A new type of bike equipped with electric seat heaters and sprockets to attract more customers in winter.