| Title: | Automatic Codebook and Tracking for 'Spark' and 'dplyr' Pipelines |
|---|---|
| Description: | Wraps 'dplyr' verbs (mutate, summarise, filter) to automatically capture variable metadata (type, source columns, categories, and source code), producing a codebook and eligibility tracking table with zero manual documentation. Works with both 'sparklyr' (tbl_spark) and local data frames. Adds big-data optimizations (caching, assume-unique counting, checkpointing) and a standardized report module with an eligibility flowchart, editable codebook export (HTML, DOCX, XLSX), and cross-sectional or longitudinal variable inspection. |
| Authors: | Patricia Fortes C. de Macedo [aut, cre] |
| Maintainer: | Patricia Fortes C. de Macedo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-09 09:02:13 UTC |
| Source: | https://github.com/patriciafortesm/autocodebook |
Works exactly like dplyr::filter(), but also logs a tracking step
recording how many unique IDs remain after the filter.
auto_filter( .data, step = "", description = "", ..., cache = NULL, assume_unique = FALSE )auto_filter( .data, step = "", description = "", ..., cache = NULL, assume_unique = FALSE )
.data |
A Spark DataFrame or local data frame. |
step |
Character label for this filtering step. |
description |
Character description of the filter. |
... |
Filter conditions, same syntax as |
cache |
Logical or NULL (named-only). If TRUE, materializes the
result with |
assume_unique |
Logical (named-only). Passed to |
The signature mirrors v0.1.0 for full backward compatibility: step
and description come first (so existing positional calls keep working),
then ... for the filter conditions, and finally the new big-data
options (cache, assume_unique) which must be passed by name.
The filtered data frame.
Works exactly like dplyr::mutate(), but also captures each expression
and registers the resulting variable in the codebook. Type, source columns,
categories, and source code are inferred automatically - you only need to
provide human-readable labels.
auto_mutate(.data, labels = list(), block = "", ...)auto_mutate(.data, labels = list(), block = "", ...)
.data |
A Spark DataFrame (tbl_spark) or local data frame. |
labels |
Named list mapping variable names to labels (descriptions). Variables not in this list get their own name as label. |
block |
Optional character label for the pipeline block/section (e.g. "Demographic variables"). Groups variables in the codebook. |
... |
Named expressions, same syntax as |
The transformed data frame (same class as input).
Works exactly like dplyr::summarise(), but also captures each expression
and registers the resulting variable in the codebook.
auto_summarise(.data, labels = list(), block = "", ..., .groups = "drop")auto_summarise(.data, labels = list(), block = "", ..., .groups = "drop")
.data |
A Spark DataFrame (tbl_spark) or local data frame. |
labels |
Named list mapping variable names to labels (descriptions). Variables not in this list get their own name as label. |
block |
Optional character label for the pipeline block/section (e.g. "Demographic variables"). Groups variables in the codebook. |
... |
Named expressions, same syntax as |
.groups |
Grouping behavior after summarise. Default: "drop". |
The summarised data frame.
Forces materialization of a lazy Spark plan. Useful in long pipelines where query plans get too deep and the optimizer starts re-computing upstream steps. For local data frames, this is a no-op.
cb_checkpoint(sdf, name = NULL, mode = c("memory", "disk", "register"))cb_checkpoint(sdf, name = NULL, mode = c("memory", "disk", "register"))
sdf |
A Spark DataFrame (tbl_spark) or local data frame. |
name |
Optional. Name to register the checkpoint under (Spark only). If NULL, a temporary name is generated. |
mode |
Character. One of |
The (possibly materialized) data frame.
Supports multiple formats based on file extension:
.html - rendered gt table (presentation)
.csv - raw tibble (programmatic reuse)
.docx - editable Word table (paper supplements, presentations)
.xlsx - editable spreadsheet with filters
cb_export(path = "codebook.html", variables = NULL, ...)cb_export(path = "codebook.html", variables = NULL, ...)
path |
File path. Extension determines format. |
variables |
Optional character vector. If provided, exports only these variables (in the given order). Default: all. |
... |
Additional arguments passed to cb_render() for HTML (e.g. show_code). |
For .docx and .xlsx, you can pass variables = c(...) to export only
a subset (useful for paper supplements / presentations).
Invisible path.
Get the current codebook as a tibble
cb_get()cb_get()
A tibble with all registered variables.
Resets the codebook and tracking logs and sets the ID column used for counting unique individuals in track_step().
cb_init(id_col = "id", verbose = FALSE, default_cache = FALSE)cb_init(id_col = "id", verbose = FALSE, default_cache = FALSE)
id_col |
Character. Name of the unique identifier column. Default: "id". |
verbose |
Logical. If TRUE, prints diagnostic messages from track_step(), auto_filter(), and cb_checkpoint(). Default: FALSE (matches v0.1.0 behavior - silent). |
default_cache |
Logical. If TRUE, big-data verbs cache intermediate results in Spark by default. Can be overridden per-call. Default: FALSE. |
Invisible NULL.
cb_init(id_col = "id_cidacs_pop100_v2") cb_init(id_col = "id", verbose = TRUE, default_cache = TRUE)cb_init(id_col = "id_cidacs_pop100_v2") cb_init(id_col = "id", verbose = TRUE, default_cache = TRUE)
Use when auto_mutate/auto_summarise doesn't apply - e.g. variables created via window_order + row_number in a separate pipeline step.
cb_register( var, label, type = NULL, source = "", categories = "", code = "", block = "" )cb_register( var, label, type = NULL, source = "", categories = "", code = "", block = "" )
var |
Variable name (character). |
label |
Human-readable description. |
type |
Optional. If NULL, defaults to "character". |
source |
Optional. Source column(s). |
categories |
Optional. Category descriptions. |
code |
Optional. Code that generated the variable. |
block |
Optional. Pipeline block label. |
Invisible NULL.
Render the codebook as a gt table
cb_render(group_by_block = TRUE, show_code = TRUE)cb_render(group_by_block = TRUE, show_code = TRUE)
group_by_block |
Logical. If TRUE and blocks are defined, groups rows by block. Default: TRUE. |
show_code |
Logical. Show the "code" column? Default: TRUE. |
A gt object.
Reset the codebook (clear all entries)
cb_reset()cb_reset()
Invisible NULL.
Toggle default caching for big-data verbs
cb_set_default_cache(default_cache = TRUE)cb_set_default_cache(default_cache = TRUE)
default_cache |
Logical. |
Invisible previous value.
Controls whether track_step(), auto_filter() and cb_checkpoint() print diagnostic messages (n removed, elapsed time, etc.).
cb_set_verbose(verbose = TRUE)cb_set_verbose(verbose = TRUE)
verbose |
Logical. |
Invisible previous value.
Returns the internal flow representation. Mostly for debugging /
programmatic access. For a tidy table, use flow_table().
flow_get()flow_get()
A list describing the flow tree.
Clears the CONSORT flow tree. Called automatically by cb_init().
flow_reset()flow_reset()
Invisible NULL.
Flattens the CONSORT flow tree into a publication-friendly data frame. One row per leaf x outcome. Split levels become named columns (using their labels), values are mapped through value_labels, and percentages are formatted as readable strings.
flow_table()flow_table()
A tibble.
Produces an HTML report combining the eligibility flowchart, the codebook, and a per-variable inspection panel. Supports two inspection modes:
generate_report( data, type = c("cross_sectional", "longitudinal"), id_var = NULL, time_var = NULL, variables = NULL, labels = NULL, treat_as_categorical = NULL, output_html = "autocodebook_report.html", output_dir = NULL, export_codebook_editable = TRUE, cache_data = TRUE, title = NULL, n_bins = 30, top_n_cat = 20 )generate_report( data, type = c("cross_sectional", "longitudinal"), id_var = NULL, time_var = NULL, variables = NULL, labels = NULL, treat_as_categorical = NULL, output_html = "autocodebook_report.html", output_dir = NULL, export_codebook_editable = TRUE, cache_data = TRUE, title = NULL, n_bins = 30, top_n_cat = 20 )
data |
A Spark DataFrame (tbl_spark) or local data frame. |
type |
One of |
id_var |
Character. Name of the ID column. For |
time_var |
Character or NULL. Name of the time/wave column.
Used in |
variables |
Optional character vector. If provided, inspects only these variables. Default: NULL (all except id_var/time_var). |
labels |
Optional named list (variable -> label). If NULL, uses labels from the codebook when available. |
treat_as_categorical |
Character vector of variable names to treat
as categorical even when their R class is numeric or integer. Useful
for coded variables (e.g. |
output_html |
File path for the HTML output. Default: "report.html". |
output_dir |
Optional directory for ancillary files (codebook.xlsx, codebook.docx, etc.). If NULL, derived from output_html. |
export_codebook_editable |
Logical. Also export codebook as
.docx and .xlsx in |
cache_data |
Logical. If TRUE and |
title |
Optional title for the report. |
n_bins |
Number of bins for numeric histograms. Default: 30. |
top_n_cat |
Max categories shown in categorical plots. Default: 20. |
cross_sectional: one plot per variable (histogram / bar / time).
longitudinal: three plots per variable (global distribution, intra-ID
variation, missingness by time) plus a meta plot of observations per ID.
All aggregations happen in Spark/dplyr; only small summaries are collected.
Invisible list with paths to all generated files.
## Not run: # Transversal generate_report(df_baseline, type = "cross_sectional", id_var = "id_indiv", treat_as_categorical = c("cod_sexo", "cod_raca"), output_html = "report_baseline.html") # Longitudinal generate_report(df_long, type = "longitudinal", id_var = "id_indiv", time_var = "ano_referencia", treat_as_categorical = c("cod_sexo"), output_html = "report_longitudinal.html") ## End(Not run)## Not run: # Transversal generate_report(df_baseline, type = "cross_sectional", id_var = "id_indiv", treat_as_categorical = c("cod_sexo", "cod_raca"), output_html = "report_baseline.html") # Longitudinal generate_report(df_long, type = "longitudinal", id_var = "id_indiv", time_var = "ano_referencia", treat_as_categorical = c("cod_sexo"), output_html = "report_longitudinal.html") ## End(Not run)
Supports .html, .csv, .docx, and .xlsx.
track_export(path = "tracking_table.html", show_elapsed = FALSE)track_export(path = "tracking_table.html", show_elapsed = FALSE)
path |
File path. |
show_elapsed |
Logical. Include elapsed_s column? Default: FALSE. |
Invisible path.
Get the current tracking log as a tibble
track_get()track_get()
A tibble with all tracking steps.
Adds one or more outcome variables, counted within each current leaf (combination of all splits so far). Outcomes are stacked, not branched. Each outcome is treated as binary: counts how many individuals have value == 1 (or TRUE), plus the percentage within the leaf.
track_outcomes(sdf, vars, labels = NULL)track_outcomes(sdf, vars, labels = NULL)
sdf |
A Spark DataFrame or local data frame. |
vars |
Character vector of outcome column names (binary 0/1 or logical). |
labels |
Optional named list (var -> label). |
sdf unchanged (for piping).
Render the tracking log as a gt table
track_render(show_elapsed = FALSE)track_render(show_elapsed = FALSE)
show_elapsed |
Logical. Show the elapsed_s column if present? Default: FALSE. |
A gt object.
Reset the tracking log
track_reset()track_reset()
Invisible NULL.
Adds one branching level to the flow tree. The cohort is divided by the
distinct values of by. Chain multiple track_split() calls to create
nested branches (e.g. exposure then mediator). Passes the data through
unchanged, so it fits in a %>% pipeline.
track_split(sdf, by, label = NULL, value_labels = NULL, max_levels = 3L)track_split(sdf, by, label = NULL, value_labels = NULL, max_levels = 3L)
sdf |
A Spark DataFrame or local data frame. |
by |
Character. Column name to split by. Its distinct values become the branches. NA values are grouped as "(NA)". |
label |
Optional character. A human-readable name for this split
level (e.g. "Exposure: drought"). Defaults to |
value_labels |
Optional named character vector mapping raw values to
readable labels, e.g. |
max_levels |
Integer. Safety cap on nesting depth. Default 3. |
sdf unchanged (for piping).
## Not run: df %>% track_split(by = "exposto_seca", label = "Exposição: seca", value_labels = c("0" = "Sem seca", "1" = "Com seca")) %>% track_split(by = "migrou", label = "Mediador: migração", value_labels = c("0" = "Não migrou", "1" = "Migrou")) %>% track_outcomes(c("obito_dcv", "obito_infec")) ## End(Not run)## Not run: df %>% track_split(by = "exposto_seca", label = "Exposição: seca", value_labels = c("0" = "Sem seca", "1" = "Com seca")) %>% track_split(by = "migrou", label = "Mediador: migração", value_labels = c("0" = "Não migrou", "1" = "Migrou")) %>% track_outcomes(c("obito_dcv", "obito_infec")) ## End(Not run)
Counts unique individuals in the current data and logs the step. Works with both tbl_spark and local data frames.
track_step(sdf, step_label, description = "", assume_unique = FALSE)track_step(sdf, step_label, description = "", assume_unique = FALSE)
sdf |
A Spark DataFrame or local data frame. |
step_label |
Short label for the step. |
description |
Longer description. |
assume_unique |
Logical. If TRUE, skips the |
Invisible integer: number of unique IDs.