Title: | Dropout Analysis by Condition |
---|---|
Description: | Analysis and visualization of dropout between conditions in surveys and (online) experiments. Features include computation of dropout statistics, comparing dropout between conditions (e.g. Chi square), analyzing survival (e.g. Kaplan-Meier estimation), comparing conditions with the most different rates of dropout (Kolmogorov-Smirnov) and visualizing the result of each in designated plotting functions. Sources: Andrea Frick, Marie-Terese Baechtiger & Ulf-Dietrich Reips (2001) <https://www.researchgate.net/publication/223956222_Financial_incentives_personal_information_and_drop-out_in_online_studies>; Ulf-Dietrich Reips (2002) "Standards for Internet-Based Experimenting" <doi:10.1027//1618-3169.49.4.243>. |
Authors: | Annika Tave Overlander [aut, cre]
|
Maintainer: | Annika Tave Overlander <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.3 |
Built: | 2025-03-05 05:48:32 UTC |
Source: | https://github.com/iscience-kn/dropr |
Find drop out positions in a data.frame that contains multiple
questions that had been asked sequentially.
This function adds the Dropout Index variable do_idx
to the data.frame which is necessary
for further analyses of dropout.
Use this function first to prepare your dropout analysis. Then, keep going by creating
the dropout statistics using compute_stats()
.
add_dropout_idx(df, q_pos)
add_dropout_idx(df, q_pos)
df |
data.frame containing |
q_pos |
numeric range of columns that contain question items |
Importantly, this function will start counting missing data at the end of the
data frame. Any missing data which is somewhere in between, i.e.
a single item that was skipped or forgotten will not be counted as dropout.
The function will identify sequences of missing data that go until the end of the
data frame and add the number of the last answered question in do_idx
.
Therefore, the variables must be in the order that they were asked, otherwise analyses will not be valid.
Returns original data frame with column do_idx
added.
R/add_dropout_idx.R
compute_stats()
which is usually the next step for dropout analysis.
dropout <- add_dropout_idx(dropRdemo, 3:54)
dropout <- add_dropout_idx(dropRdemo, 3:54)
This is the second step in conducting dropout analysis with dropR
.
Outputs all necessary statistics to analyze and visualize dropout, such as
the sample size N of the data (and in each condition if selected), cumulative
dropout and remaining participants in absolute numbers and percent.
If no experimental condition is added, the stats are only calculated for the
whole data in total.
compute_stats(df, by_cond = "None", no_of_vars)
compute_stats(df, by_cond = "None", no_of_vars)
df |
data.frame containing variable |
by_cond |
character name of condition variable in the data, defaults to 'None' to output total statistics. |
no_of_vars |
numeric number of variables that contain questions |
A data frame with 6 columns (q_idx, condition, cs, N, remain, pct_remain) and as many rows as questions in original data (for overall data and if conditions selected again for each condition).
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52)
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52)
This function performs a chi-squared contingency table test on dropout for
a given question in the data. Note that the input data should be in the format as
computed by compute_stats()
.
The test can be performed on either all conditions (excluding total) or on select conditions.
do_chisq(do_stats, chisq_question, sel_cond_chisq, p_sim = TRUE)
do_chisq(do_stats, chisq_question, sel_cond_chisq, p_sim = TRUE)
do_stats |
data.frame of dropout statistics as computed by |
chisq_question |
numeric Which question to compare dropout at. |
sel_cond_chisq |
vector (same class as in conditions variable in original data set) selected conditions. |
p_sim |
boolean Simulate p value parameter (by Monte Carlo simulation)? Defaults to |
Returns test results from chisq.test between experimental conditions at defined question.
add_dropout_idx()
and compute_stats()
which are necessary for the proper data structure.
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_chisq(do_stats, 47, c(12, 22), TRUE)
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_chisq(do_stats, 47, c(12, 22), TRUE)
This function needs a data set with a dropout index added by add_dropout_idx()
.
The do_kpm
function performs survival analysis with Kaplan-Meier Estimation
and returns a list containing survival steps, the original data frame, and the model fit type.
The function can fit the survival model either for the entire data set or separately by a specified condition column.
do_kpm(df, condition_col = "experimental_condition", model_fit = "total")
do_kpm(df, condition_col = "experimental_condition", model_fit = "total")
df |
data set with |
condition_col |
character denoting the experimental conditions to model |
model_fit |
character Should be either "total" for a total model or "conditions" |
Returns a list containing steps
(survival steps extracted from the fitted models),
d
(the original data frame), and model_fit
(the model fit type).
survival::Surv()
used to fit survival object.
demo_kpm <- do_kpm(df = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "total") head(demo_kpm$steps)
demo_kpm <- do_kpm(df = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "total") head(demo_kpm$steps)
This test is used for survival analysis between the most extreme conditions,
so the ones with the most different rates of dropout.
This function automatically prepares your data and runs stats::ks.test()
on it.
do_ks(do_stats, question)
do_ks(do_stats, question)
do_stats |
A data frame made from |
question |
Index of question to be included in analysis, commonly the last question of the survey. |
Returns result of Kolmogorov-Smirnoff test including which conditions have the most different dropout rates.
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_ks(do_stats, 52)
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_ks(do_stats, 52)
This function calculates an Odds Ratio table at a given question for selected experimental
conditions. It needs data in the format as created by compute_stats()
as input.
do_or_table(do_stats, chisq_question, sel_cond_chisq)
do_or_table(do_stats, chisq_question, sel_cond_chisq)
do_stats |
data.frame statistics table as computed by |
chisq_question |
numeric Which question to calculate the OR table for |
sel_cond_chisq |
character vector naming the experimental conditions to compare |
Returns a Matrix containing the Odds Ratios of dropout between all selected conditions.
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_or_table(do_stats, chisq_question = 51, sel_cond_chisq = c("11", "12", "21", "22"))
do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) do_or_table(do_stats, chisq_question = 51, sel_cond_chisq = c("11", "12", "21", "22"))
The do_steps
function calculates steps for data points represented by numbers of questions from the original
experimental or survey data in x
and remaining percent of participants in y
.
do_steps(x, y, return_df = TRUE)
do_steps(x, y, return_df = TRUE)
x |
Numeric vector representing the question numbers |
y |
Numeric vector representing the remaining percent of participants |
return_df |
Logical. If TRUE, the function returns a data frame; otherwise, it returns a list. |
Due to the nature of dropout/ survival data, step functions are necessary to accurately depict participants remaining. Dropout data includes the time until the event (a.k.a. dropout at a certain question or time), so that changes in remaining participants are discrete rather than continuous. This means that changes in survival probability occur at specific points and are better represented as steps than as a continuum.
Returns a data frame or a list containing the modified x
and y
values.
x <- c(1, 2, 3, 4, 5) y <- c(100, 100, 95, 90, 85) do_steps(x, y) # Using the example dataset dropRdemo do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) tot_stats <- do_stats[do_stats$condition == "total", ] do_steps(tot_stats$q_idx, tot_stats$pct_remain)
x <- c(1, 2, 3, 4, 5) y <- c(100, 100, 95, 90, 85) do_steps(x, y) # Using the example dataset dropRdemo do_stats <- compute_stats(df = add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) tot_stats <- do_stats[do_stats$condition == "total", ] do_steps(tot_stats$q_idx, tot_stats$pct_remain)
Simulated demo data set for dropout in a survey.
A data frame with 246 rows and 54 variables (in the order they were presented in the fictional survey).
Observation ID
experimental condition
item 1
item 2
item 3
item 4
item 5
item 6
item 7
item 8
item 9
item 10
item 11
item 12
item 13
item 14
item 15
item 16
item 17
item 18
item 19
item 20
item 21
item 22
item 23
item 24
item 25
item 26
item 27
item 28
item 29
item 30
item 31
item 32
item 33
item 34
item 35
item 36
item 37
item 38
item 39
item 40
item 41
item 42
item 43
item 44
item 45
item 46
item 47
item 48
item 49
item 50
item 51
item 52
dropRdemo Demo data for dropout.
Compute odds from probabilities. The function is vectorized and
can handle a vector of probabilities, e.g. remaining percent of participants
as calculated by compute_stats()
.
get_odds(p)
get_odds(p)
p |
vector of probabilities. May not be larger than 1 or smaller than zero. |
Returns numerical vector of the same length as original input reflecting the odds.
get_odds(0.7) get_odds(c(0.7, 0.2))
get_odds(0.7) get_odds(c(0.7, 0.2))
Computes odds ratio given two probabilities. In this package, the function can be used to compare the percentages of remaining participants between two conditions at a time.
get_odds_ratio(a, b)
get_odds_ratio(a, b)
a |
numeric probability value between 0 and 1. |
b |
numeric probability value between 0 and 1. |
Returns numerical vector of the same length as original input reflecting the Odds Ratio (OR).
get_odds()
, as this is the basis for calculation.
get_odds_ratio(0.7, 0.6)
get_odds_ratio(0.7, 0.6)
The get_steps_by_cond
function calculates steps data based on survival model results.
This utility function is used inside the do_kpm()
function of dropR
.
get_steps_by_cond(sfit, condition = NULL)
get_steps_by_cond(sfit, condition = NULL)
sfit |
An object representing survival model results (e.g., from a Kaplan-Meier model). |
condition |
Optional. An experimental condition to include in the output data frame, defaults to |
Returns a data frame containing the steps data, including time, survival estimates, upper confidence bounds, and lower confidence bounds.
This function compares survival curves as modeled with do_kpm()
.
It outputs a contingency table and a Chisq measure of difference.
get_survdiff(kds, cond, test_type)
get_survdiff(kds, cond, test_type)
kds |
data set of a survival model such as |
cond |
character of experimental condition variable in the data |
test_type |
numeric (0 or 1) parameter that controls the type of test (0 means rho = 0; log-rank, 1 means rho = 1; Peto & Peto Wilcox) |
Returns survival test results as called from survival::survdiff()
.
kpm_est <- do_kpm(add_dropout_idx(dropRdemo, 3:54)) get_survdiff(kpm_est$d, "experimental_condition", 0) get_survdiff(kpm_est$d, "experimental_condition", 1)
kpm_est <- do_kpm(add_dropout_idx(dropRdemo, 3:54)) get_survdiff(kpm_est$d, "experimental_condition", 0) get_survdiff(kpm_est$d, "experimental_condition", 1)
This functions uses ggplot2
to create drop out curves.
Please note that you should use add_dropout_idx()
and compute_stats()
on your
data before running this function as it needs a certain data structure and variables to
work properly.
plot_do_curve( do_stats, linetypes = TRUE, stroke_width = 1, full_scale = TRUE, show_points = FALSE, show_confbands = FALSE, color_palette = "color_blind" )
plot_do_curve( do_stats, linetypes = TRUE, stroke_width = 1, full_scale = TRUE, show_points = FALSE, show_confbands = FALSE, color_palette = "color_blind" )
do_stats |
data.frame containing dropout statistics table computed by |
linetypes |
boolean Should different line types be used? Defaults to TRUE. |
stroke_width |
numeric stroke width, defaults to 1. |
full_scale |
boolean Should y axis range from 0 to 100? Defaults to TRUE, FALSE cuts off at min percent remaining (>0). |
show_points |
boolean Should dropout curves show individual data points? Defaults to FALSE. |
show_confbands |
boolean Should there be confidence bands added to the plot? Defaults to FALSE. |
color_palette |
character indicating which color palette to use. Defaults to 'color_blind', alternatively choose 'gray' or 'default' for the ggplot2 default colors. |
Returns a ggplot
object containing the dropout curve plot. Using the Shiny App version of
dropR, this plot can easily be downloaded in different formats.
add_dropout_idx()
and compute_stats()
which are necessary for the proper data structure.
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) plot_do_curve(do_stats)
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) plot_do_curve(do_stats)
The plot_do_kpm
function generates a Kaplan-Meier survival plot based on the
output from the do_kpm()
function. It allows for customization of conditions
to display, confidence intervals, color palettes, and y-axis scaling.
plot_do_kpm( kds, sel_conds = c("11", "12", "21", "22"), kpm_ci = TRUE, full_scale_kpm = FALSE, color_palette_kp = "color_blind" )
plot_do_kpm( kds, sel_conds = c("11", "12", "21", "22"), kpm_ci = TRUE, full_scale_kpm = FALSE, color_palette_kp = "color_blind" )
kds |
list object as modeled by |
sel_conds |
character Which experimental conditions to plot. |
kpm_ci |
boolean Should there be confidence bands in the plot? Defaults to TRUE. |
full_scale_kpm |
boolean Should the Y axis show the full range from 0 to 100? Defaults to FALSE. |
color_palette_kp |
character indicating which color palette to use. Defaults to 'color_blind', alternatively choose 'gray' for gray scale values or 'default' for the ggplot2 default colors. |
Returns a ggplot
object containing the Kaplan-Meier survival plot. Using the Shiny App version of
dropR, this plot can easily be downloaded in different formats.
plot_do_kpm(do_kpm(d = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "total")) plot_do_kpm(do_kpm(d = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "conditions"), sel_conds = c("11", "12", "21", "22"))
plot_do_kpm(do_kpm(d = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "total")) plot_do_kpm(do_kpm(d = add_dropout_idx(dropRdemo, 3:54), condition_col = "experimental_condition", model_fit = "conditions"), sel_conds = c("11", "12", "21", "22"))
With this function, you can easily plot the most extreme conditions, a.k.a. those with the most
different dropout rates at a certain question. You need to define that question in the function call of
do_ks()
already, or just call that function directly inside the plot function.
plot_do_ks( do_stats, ks, linetypes = FALSE, show_confbands = FALSE, color_palette = c("#E69F00", "#CC79A7") )
plot_do_ks( do_stats, ks, linetypes = FALSE, show_confbands = FALSE, color_palette = c("#E69F00", "#CC79A7") )
do_stats |
data.frame containing dropout statistics table computed by |
ks |
List of results from the |
linetypes |
boolean Should different line types be used? Defaults to FALSE. |
show_confbands |
boolean Should there be confidence bands added to the plot? Defaults to FALSE. |
color_palette |
character indicating which color palette to use. Defaults to color blind friendly values,
alternatively choose 'gray' or create your own palette with two colors, e.g. using R |
Returns a ggplot
object containing the survival curve plot of the most extreme
dropout conditions. Using the Shiny App version of dropR, this plot can easily be downloaded in different formats.
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) ks <- do_ks(do_stats, 52) plot_do_ks(do_stats, ks, color_palette = "gray") # ... or call the do_ks() function directly inside the plotting function plot_do_ks(do_stats, do_ks(do_stats, 30)) plot_do_ks(do_stats, ks, linetypes = TRUE, show_confbands = TRUE, color_palette = c("red", "violet"))
do_stats <- compute_stats(add_dropout_idx(dropRdemo, 3:54), by_cond = "experimental_condition", no_of_vars = 52) ks <- do_ks(do_stats, 52) plot_do_ks(do_stats, ks, color_palette = "gray") # ... or call the do_ks() function directly inside the plotting function plot_do_ks(do_stats, do_ks(do_stats, 30)) plot_do_ks(do_stats, ks, linetypes = TRUE, show_confbands = TRUE, color_palette = c("red", "violet"))
Starts the interactive web application to use dropR in your web browser. Make sure to use Google Chrome or Firefox for best experience.
start_app()
start_app()
The app will give less experienced R users or statisticians a good overview of how to conduct dropout analysis. For more experienced analysts, it can still be very helpful in guiding how to use the package as there are some steps that should be taken in order, which is outlined in the app (as well as function documentation).
No return value; starts the shiny app as a helper to get started with dropout analysis. All app procedures are available as functions.