Automate cohort analysis with R

Posted by Hai Vo on Thursday, April 21, 2022

The goal of rcohort is to perform cohort analysis with R. This package build on top of data.table hence it should bring good speed!

Installation

rcohort use data.table version 1.14.3, which is not available on CRAN yet. Therefore, you need to update data.table to the development version first to be able to use rcohort. You can install the development version of rcohort from GitHub with:

# install.packages("devtools")
data.table::update.dev.pkg()
devtools::install_github("vohai611/rcohort")

Example

This is a basic example which shows you how to execute cohort analysis with rcohort

This is a typical dataset of an online retailer (Source)

head(df) |>
  knitr::kable()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2010-12-01 08:26:00 7.65 17850 United Kingdom

cohort_count() perform cohort analysis by counting the appearance of customer in each time period. In particular, Customer are grouped by the first time they appear in the dataset. We can choose the rounded time to every “month”, “week”, “quarter” or “year”.

library(rcohort)

df1 = df |>
  cohort_count(.id_col = CustomerID,
               .time_col = InvoiceDate,
               time_unit = "month",
               percent_form = TRUE)
head(df1, 5)
##     cohort month_1  month_2  month_3  month_4  month_5  month_6  month_7
##     <fctr>   <num>    <num>    <num>    <num>    <num>    <num>    <num>
## 1: 2010-12     100 36.61017 32.31638 38.41808 36.27119 39.77401 36.27119
## 2:  2011-1     100 22.06235 26.61871 23.02158 32.13429 28.77698 24.70024
## 3:  2011-2     100 18.68421 18.68421 28.42105 27.10526 24.73684 25.26316
## 4:  2011-3     100 15.04425 25.22124 19.91150 22.34513 16.81416 26.76991
## 5:  2011-4     100 21.33333 20.33333 21.00000 19.66667 22.66667 21.66667
## 6 variable(s) not shown: [month_8 <num>, month_9 <num>, month_10 <num>, month_11 <num>, month_12 <num>, month_13 <num>]

To include group of all customer, use all_group = TRUE

df2 = df |>
  cohort_count(.id_col = CustomerID,
               .time_col = InvoiceDate,
               time_unit = "month",
               all_group = TRUE)
head(df2, 5)
##     cohort month_1  month_2  month_3  month_4  month_5  month_6  month_7
##     <fctr>   <num>    <num>    <num>    <num>    <num>    <num>    <num>
## 1:     All     100 22.49885 21.71508 21.36929 20.86215 20.23974 18.53389
## 2: 2010-12     100 36.61017 32.31638 38.41808 36.27119 39.77401 36.27119
## 3:  2011-1     100 22.06235 26.61871 23.02158 32.13429 28.77698 24.70024
## 4:  2011-2     100 18.68421 18.68421 28.42105 27.10526 24.73684 25.26316
## 5:  2011-3     100 15.04425 25.22124 19.91150 22.34513 16.81416 26.76991
## 6 variable(s) not shown: [month_8 <num>, month_9 <num>, month_10 <num>, month_11 <num>, month_12 <num>, month_13 <num>]

Beside counting the appearance of customers, you can summarise cohort performance over time. For example, we can observe the change of the revenue over time.

df$Revenue = df$Quantity * df$UnitPrice

cohort_df = df |> 
  cohort_summarise(.id_col = CustomerID,
                   .time_col = InvoiceDate,
                  .summarise_col = Revenue,.fn = "sum", time_unit = "month",
                  percent_form = TRUE) 

head(cohort_df,10) |>
  knitr::kable()
cohort month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 month_13
2010-12 100 48.23310 40.83110 52.926845 35.69106 58.777604 54.855071 54.26513 57.92582 82.55669 79.634983 89.713834 32.44257
2011-1 100 18.80918 21.57796 24.415281 27.66441 28.847769 23.910519 24.79123 24.56407 38.13134 42.250680 9.025974 NA
2011-2 100 18.36528 26.01103 30.503496 25.38724 21.640813 31.431630 39.49033 35.01918 40.98179 6.702973 NA NA
2011-3 100 15.02446 29.51843 21.400819 25.81963 20.003408 32.448395 35.49956 35.62045 6.43176 NA NA NA
2011-4 100 24.13938 20.55304 19.925558 21.56698 24.756930 23.496951 28.05360 5.20451 NA NA NA NA
2011-5 100 15.05414 16.27167 15.440440 22.45046 26.571793 26.842784 144.43680 NA NA NA NA NA
2011-6 100 10.90675 10.42809 22.808986 19.70971 31.536260 6.060978 NA NA NA NA NA NA
2011-7 100 15.93010 20.98965 23.715398 26.36460 8.204651 NA NA NA NA NA NA NA
2011-8 100 26.28380 44.51559 55.847465 19.16388 NA NA NA NA NA NA NA NA
2011-9 100 18.60238 23.87688 7.933991 NA NA NA NA NA NA NA NA NA

We can also quickly visualize cohort data by calling auto_plot()

cohort_df |> auto_plot("tile")