The goal of rcohort
is to perform cohort analysis with R. This package build on top of data.table
hence it should bring good speed!
Installation
rcohort
use data.table
version 1.14.3, which is not available on CRAN yet. Therefore, you need to update data.table
to the development version first to be able to use rcohort
.
You can install the development version of rcohort
from GitHub with:
# install.packages("devtools")
data.table::update.dev.pkg()
devtools::install_github("vohai611/rcohort")
Example
This is a basic example which shows you how to execute cohort analysis with rcohort
This is a typical dataset of an online retailer (Source)
head(df) |>
knitr::kable()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850 | United Kingdom |
536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850 | United Kingdom |
536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2010-12-01 08:26:00 | 7.65 | 17850 | United Kingdom |
cohort_count()
perform cohort analysis by counting the appearance of customer in each time period. In particular, Customer are grouped by the first time they appear in the dataset. We can choose the rounded time to every “month”, “week”, “quarter” or “year”.
library(rcohort)
df1 = df |>
cohort_count(.id_col = CustomerID,
.time_col = InvoiceDate,
time_unit = "month",
percent_form = TRUE)
head(df1, 5)
## cohort month_1 month_2 month_3 month_4 month_5 month_6 month_7
## <fctr> <num> <num> <num> <num> <num> <num> <num>
## 1: 2010-12 100 36.61017 32.31638 38.41808 36.27119 39.77401 36.27119
## 2: 2011-1 100 22.06235 26.61871 23.02158 32.13429 28.77698 24.70024
## 3: 2011-2 100 18.68421 18.68421 28.42105 27.10526 24.73684 25.26316
## 4: 2011-3 100 15.04425 25.22124 19.91150 22.34513 16.81416 26.76991
## 5: 2011-4 100 21.33333 20.33333 21.00000 19.66667 22.66667 21.66667
## 6 variable(s) not shown: [month_8 <num>, month_9 <num>, month_10 <num>, month_11 <num>, month_12 <num>, month_13 <num>]
To include group of all customer, use all_group = TRUE
df2 = df |>
cohort_count(.id_col = CustomerID,
.time_col = InvoiceDate,
time_unit = "month",
all_group = TRUE)
head(df2, 5)
## cohort month_1 month_2 month_3 month_4 month_5 month_6 month_7
## <fctr> <num> <num> <num> <num> <num> <num> <num>
## 1: All 100 22.49885 21.71508 21.36929 20.86215 20.23974 18.53389
## 2: 2010-12 100 36.61017 32.31638 38.41808 36.27119 39.77401 36.27119
## 3: 2011-1 100 22.06235 26.61871 23.02158 32.13429 28.77698 24.70024
## 4: 2011-2 100 18.68421 18.68421 28.42105 27.10526 24.73684 25.26316
## 5: 2011-3 100 15.04425 25.22124 19.91150 22.34513 16.81416 26.76991
## 6 variable(s) not shown: [month_8 <num>, month_9 <num>, month_10 <num>, month_11 <num>, month_12 <num>, month_13 <num>]
Beside counting the appearance of customers, you can summarise cohort performance over time. For example, we can observe the change of the revenue over time.
df$Revenue = df$Quantity * df$UnitPrice
cohort_df = df |>
cohort_summarise(.id_col = CustomerID,
.time_col = InvoiceDate,
.summarise_col = Revenue,.fn = "sum", time_unit = "month",
percent_form = TRUE)
head(cohort_df,10) |>
knitr::kable()
cohort | month_1 | month_2 | month_3 | month_4 | month_5 | month_6 | month_7 | month_8 | month_9 | month_10 | month_11 | month_12 | month_13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-12 | 100 | 48.23310 | 40.83110 | 52.926845 | 35.69106 | 58.777604 | 54.855071 | 54.26513 | 57.92582 | 82.55669 | 79.634983 | 89.713834 | 32.44257 |
2011-1 | 100 | 18.80918 | 21.57796 | 24.415281 | 27.66441 | 28.847769 | 23.910519 | 24.79123 | 24.56407 | 38.13134 | 42.250680 | 9.025974 | NA |
2011-2 | 100 | 18.36528 | 26.01103 | 30.503496 | 25.38724 | 21.640813 | 31.431630 | 39.49033 | 35.01918 | 40.98179 | 6.702973 | NA | NA |
2011-3 | 100 | 15.02446 | 29.51843 | 21.400819 | 25.81963 | 20.003408 | 32.448395 | 35.49956 | 35.62045 | 6.43176 | NA | NA | NA |
2011-4 | 100 | 24.13938 | 20.55304 | 19.925558 | 21.56698 | 24.756930 | 23.496951 | 28.05360 | 5.20451 | NA | NA | NA | NA |
2011-5 | 100 | 15.05414 | 16.27167 | 15.440440 | 22.45046 | 26.571793 | 26.842784 | 144.43680 | NA | NA | NA | NA | NA |
2011-6 | 100 | 10.90675 | 10.42809 | 22.808986 | 19.70971 | 31.536260 | 6.060978 | NA | NA | NA | NA | NA | NA |
2011-7 | 100 | 15.93010 | 20.98965 | 23.715398 | 26.36460 | 8.204651 | NA | NA | NA | NA | NA | NA | NA |
2011-8 | 100 | 26.28380 | 44.51559 | 55.847465 | 19.16388 | NA | NA | NA | NA | NA | NA | NA | NA |
2011-9 | 100 | 18.60238 | 23.87688 | 7.933991 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
We can also quickly visualize cohort data by calling auto_plot()
cohort_df |> auto_plot("tile")