Package 'autocodebook' reference manual

Title:	Automatic Codebook and Tracking for 'Spark' and 'dplyr' Pipelines
Description:	Wraps 'dplyr' verbs (mutate, summarise, filter) to automatically capture variable metadata (type, source columns, categories, and source code), producing a codebook and eligibility tracking table with zero manual documentation. Works with both 'sparklyr' (tbl_spark) and local data frames. Adds big-data optimizations (caching, assume-unique counting, checkpointing) and a standardized report module with an eligibility flowchart, editable codebook export (HTML, DOCX, XLSX), and cross-sectional or longitudinal variable inspection.
Authors:	Patricia Fortes C. de Macedo [aut, cre]
Maintainer:	Patricia Fortes C. de Macedo <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.0
Built:	2026-06-09 09:02:13 UTC
Source:	https://github.com/patriciafortesm/autocodebook

Filter with automatic tracking

Description

Works exactly like dplyr::filter(), but also logs a tracking step recording how many unique IDs remain after the filter.

Usage

auto_filter(
  .data,
  step = "",
  description = "",
  ...,
  cache = NULL,
  assume_unique = FALSE
)
auto_filter(
  .data,
  step = "",
  description = "",
  ...,
  cache = NULL,
  assume_unique = FALSE
)

Arguments

.data

A Spark DataFrame or local data frame.

step

Character label for this filtering step.

description

Character description of the filter.

...

Filter conditions, same syntax as dplyr::filter().

cache

Logical or NULL (named-only). If TRUE, materializes the result with cb_checkpoint() after filtering - useful in long Spark pipelines. If NULL, falls back to the session default (set via cb_init() or cb_set_default_cache()). Default: NULL.

assume_unique

Logical (named-only). Passed to track_step(). Set TRUE only when you are certain the ID column has no duplicates at this stage. Default: FALSE.

Details

The signature mirrors v0.1.0 for full backward compatibility: step and description come first (so existing positional calls keep working), then ... for the filter conditions, and finally the new big-data options (cache, assume_unique) which must be passed by name.

Value

The filtered data frame.

Mutate with automatic codebook registration

Description

Works exactly like dplyr::mutate(), but also captures each expression and registers the resulting variable in the codebook. Type, source columns, categories, and source code are inferred automatically - you only need to provide human-readable labels.

Usage

auto_mutate(.data, labels = list(), block = "", ...)
auto_mutate(.data, labels = list(), block = "", ...)

Arguments

.data

A Spark DataFrame (tbl_spark) or local data frame.

labels

Named list mapping variable names to labels (descriptions). Variables not in this list get their own name as label.

block

Optional character label for the pipeline block/section (e.g. "Demographic variables"). Groups variables in the codebook.

...

Named expressions, same syntax as dplyr::mutate().

Value

The transformed data frame (same class as input).

Summarise with automatic codebook registration

Description

Works exactly like dplyr::summarise(), but also captures each expression and registers the resulting variable in the codebook.

Usage

auto_summarise(.data, labels = list(), block = "", ..., .groups = "drop")
auto_summarise(.data, labels = list(), block = "", ..., .groups = "drop")

Arguments

.data

A Spark DataFrame (tbl_spark) or local data frame.

labels

Named list mapping variable names to labels (descriptions). Variables not in this list get their own name as label.

block

Optional character label for the pipeline block/section (e.g. "Demographic variables"). Groups variables in the codebook.

...

Named expressions, same syntax as dplyr::mutate().

.groups

Grouping behavior after summarise. Default: "drop".

Value

The summarised data frame.

Checkpoint a Spark DataFrame

Description

Forces materialization of a lazy Spark plan. Useful in long pipelines where query plans get too deep and the optimizer starts re-computing upstream steps. For local data frames, this is a no-op.

Usage

cb_checkpoint(sdf, name = NULL, mode = c("memory", "disk", "register"))
cb_checkpoint(sdf, name = NULL, mode = c("memory", "disk", "register"))

Arguments

sdf

A Spark DataFrame (tbl_spark) or local data frame.

name

Optional. Name to register the checkpoint under (Spark only). If NULL, a temporary name is generated.

mode

Character. One of "memory" (cache in memory, fastest), "disk" (sdf_checkpoint via disk, more durable), or "register" (just register as temp table without caching). Default: "memory".

Value

The (possibly materialized) data frame.

Export codebook to file

Description

Supports multiple formats based on file extension:

.html - rendered gt table (presentation)
.csv - raw tibble (programmatic reuse)
.docx - editable Word table (paper supplements, presentations)
.xlsx - editable spreadsheet with filters

Usage

cb_export(path = "codebook.html", variables = NULL, ...)
cb_export(path = "codebook.html", variables = NULL, ...)

Arguments

path

File path. Extension determines format.

variables

Optional character vector. If provided, exports only these variables (in the given order). Default: all.

...

Additional arguments passed to cb_render() for HTML (e.g. show_code).

Details

For .docx and .xlsx, you can pass variables = c(...) to export only a subset (useful for paper supplements / presentations).

Value

Invisible path.

Get the current codebook as a tibble

Description

Get the current codebook as a tibble

Usage

cb_get()
cb_get()

Value

A tibble with all registered variables.

Initialize autocodebook session

Description

Resets the codebook and tracking logs and sets the ID column used for counting unique individuals in track_step().

Usage

cb_init(id_col = "id", verbose = FALSE, default_cache = FALSE)
cb_init(id_col = "id", verbose = FALSE, default_cache = FALSE)

Arguments

id_col

Character. Name of the unique identifier column. Default: "id".

verbose

Logical. If TRUE, prints diagnostic messages from track_step(), auto_filter(), and cb_checkpoint(). Default: FALSE (matches v0.1.0 behavior - silent).

default_cache

Logical. If TRUE, big-data verbs cache intermediate results in Spark by default. Can be overridden per-call. Default: FALSE.

Value

Invisible NULL.

Examples

cb_init(id_col = "id_cidacs_pop100_v2")
cb_init(id_col = "id", verbose = TRUE, default_cache = TRUE)
cb_init(id_col = "id_cidacs_pop100_v2")
cb_init(id_col = "id", verbose = TRUE, default_cache = TRUE)

Manually register a variable in the codebook

Description

Use when auto_mutate/auto_summarise doesn't apply - e.g. variables created via window_order + row_number in a separate pipeline step.

Usage

cb_register(
  var,
  label,
  type = NULL,
  source = "",
  categories = "",
  code = "",
  block = ""
)
cb_register(
  var,
  label,
  type = NULL,
  source = "",
  categories = "",
  code = "",
  block = ""
)

Arguments

var

Variable name (character).

label

Human-readable description.

type

Optional. If NULL, defaults to "character".

source

Optional. Source column(s).

categories

Optional. Category descriptions.

code

Optional. Code that generated the variable.

block

Optional. Pipeline block label.

Value

Invisible NULL.

Render the codebook as a gt table

Description

Render the codebook as a gt table

Usage

cb_render(group_by_block = TRUE, show_code = TRUE)
cb_render(group_by_block = TRUE, show_code = TRUE)

Arguments

group_by_block

Logical. If TRUE and blocks are defined, groups rows by block. Default: TRUE.

show_code

Logical. Show the "code" column? Default: TRUE.

Value

A gt object.

Reset the codebook (clear all entries)

Description

Reset the codebook (clear all entries)

Usage

cb_reset()
cb_reset()

Value

Invisible NULL.

Toggle default caching for big-data verbs

Description

Toggle default caching for big-data verbs

Usage

cb_set_default_cache(default_cache = TRUE)
cb_set_default_cache(default_cache = TRUE)

Arguments

default_cache

Logical.

Value

Invisible previous value.

Toggle verbose diagnostic messages

Description

Controls whether track_step(), auto_filter() and cb_checkpoint() print diagnostic messages (n removed, elapsed time, etc.).

Usage

cb_set_verbose(verbose = TRUE)
cb_set_verbose(verbose = TRUE)

Arguments

verbose

Logical.

Value

Invisible previous value.

Get the current flow tree (raw structure)

Description

Returns the internal flow representation. Mostly for debugging / programmatic access. For a tidy table, use flow_table().

Usage

flow_get()
flow_get()

Value

A list describing the flow tree.

Reset the flow tree

Description

Clears the CONSORT flow tree. Called automatically by cb_init().

Usage

flow_reset()
flow_reset()

Value

Invisible NULL.

Flow tree as a tidy table

Description

Flattens the CONSORT flow tree into a publication-friendly data frame. One row per leaf x outcome. Split levels become named columns (using their labels), values are mapped through value_labels, and percentages are formatted as readable strings.

Usage

flow_table()
flow_table()

Value

A tibble.

Generate a standardized report from the current session

Description

Produces an HTML report combining the eligibility flowchart, the codebook, and a per-variable inspection panel. Supports two inspection modes:

Usage

generate_report(
  data,
  type = c("cross_sectional", "longitudinal"),
  id_var = NULL,
  time_var = NULL,
  variables = NULL,
  labels = NULL,
  treat_as_categorical = NULL,
  output_html = "autocodebook_report.html",
  output_dir = NULL,
  export_codebook_editable = TRUE,
  cache_data = TRUE,
  title = NULL,
  n_bins = 30,
  top_n_cat = 20
)
generate_report(
  data,
  type = c("cross_sectional", "longitudinal"),
  id_var = NULL,
  time_var = NULL,
  variables = NULL,
  labels = NULL,
  treat_as_categorical = NULL,
  output_html = "autocodebook_report.html",
  output_dir = NULL,
  export_codebook_editable = TRUE,
  cache_data = TRUE,
  title = NULL,
  n_bins = 30,
  top_n_cat = 20
)

Arguments

data

A Spark DataFrame (tbl_spark) or local data frame.

type

One of "cross_sectional" or "longitudinal".

id_var

Character. Name of the ID column. For longitudinal, mandatory. For cross_sectional, used to skip the ID column in inspection.

time_var

Character or NULL. Name of the time/wave column. Used in longitudinal to compute missingness-over-time. Default: NULL.

variables

Optional character vector. If provided, inspects only these variables. Default: NULL (all except id_var/time_var).

labels

Optional named list (variable -> label). If NULL, uses labels from the codebook when available.

treat_as_categorical

Character vector of variable names to treat as categorical even when their R class is numeric or integer. Useful for coded variables (e.g. cod_sexo stored as 1L/2L, cod_raca stored as integer). For these variables, the report uses bar charts and proportion-by-time stacked plots instead of histograms / median+IQR. Default: NULL.

output_html

File path for the HTML output. Default: "report.html".

output_dir

Optional directory for ancillary files (codebook.xlsx, codebook.docx, etc.). If NULL, derived from output_html.

export_codebook_editable

Logical. Also export codebook as .docx and .xlsx in output_dir. Default: TRUE.

cache_data

Logical. If TRUE and data is a tbl_spark, persists the dataset once before the report aggregations, then releases it on exit. No-op for local data frames. Default: TRUE.

title

Optional title for the report.

n_bins

Number of bins for numeric histograms. Default: 30.

top_n_cat

Max categories shown in categorical plots. Default: 20.

Details

cross_sectional: one plot per variable (histogram / bar / time).
longitudinal: three plots per variable (global distribution, intra-ID variation, missingness by time) plus a meta plot of observations per ID.

All aggregations happen in Spark/dplyr; only small summaries are collected.

Value

Invisible list with paths to all generated files.

Examples

## Not run: 
# Transversal
generate_report(df_baseline, type = "cross_sectional",
                id_var = "id_indiv",
                treat_as_categorical = c("cod_sexo", "cod_raca"),
                output_html = "report_baseline.html")

# Longitudinal
generate_report(df_long, type = "longitudinal",
                id_var   = "id_indiv",
                time_var = "ano_referencia",
                treat_as_categorical = c("cod_sexo"),
                output_html = "report_longitudinal.html")

## End(Not run)
## Not run: 
# Transversal
generate_report(df_baseline, type = "cross_sectional",
                id_var = "id_indiv",
                treat_as_categorical = c("cod_sexo", "cod_raca"),
                output_html = "report_baseline.html")

# Longitudinal
generate_report(df_long, type = "longitudinal",
                id_var   = "id_indiv",
                time_var = "ano_referencia",
                treat_as_categorical = c("cod_sexo"),
                output_html = "report_longitudinal.html")

## End(Not run)

Export tracking table to file

Description

Supports .html, .csv, .docx, and .xlsx.

Usage

track_export(path = "tracking_table.html", show_elapsed = FALSE)
track_export(path = "tracking_table.html", show_elapsed = FALSE)

Arguments

path

File path.

show_elapsed

Logical. Include elapsed_s column? Default: FALSE.

Value

Invisible path.

Get the current tracking log as a tibble

Description

Get the current tracking log as a tibble

Usage

track_get()
track_get()

Value

A tibble with all tracking steps.

Attach outcome counts to the current leaves (CONSORT flowchart)

Description

Adds one or more outcome variables, counted within each current leaf (combination of all splits so far). Outcomes are stacked, not branched. Each outcome is treated as binary: counts how many individuals have value == 1 (or TRUE), plus the percentage within the leaf.

Usage

track_outcomes(sdf, vars, labels = NULL)
track_outcomes(sdf, vars, labels = NULL)

Arguments

sdf

A Spark DataFrame or local data frame.

vars

Character vector of outcome column names (binary 0/1 or logical).

labels

Optional named list (var -> label).

Value

sdf unchanged (for piping).

Render the tracking log as a gt table

Description

Render the tracking log as a gt table

Usage

track_render(show_elapsed = FALSE)
track_render(show_elapsed = FALSE)

Arguments

show_elapsed

Logical. Show the elapsed_s column if present? Default: FALSE.

Value

A gt object.

Reset the tracking log

Description

Reset the tracking log

Usage

track_reset()
track_reset()

Value

Invisible NULL.

Split the cohort into branches by a column (CONSORT flowchart)

Description

Adds one branching level to the flow tree. The cohort is divided by the distinct values of by. Chain multiple track_split() calls to create nested branches (e.g. exposure then mediator). Passes the data through unchanged, so it fits in a ⁠%>%⁠ pipeline.

Usage

track_split(sdf, by, label = NULL, value_labels = NULL, max_levels = 3L)
track_split(sdf, by, label = NULL, value_labels = NULL, max_levels = 3L)

Arguments

sdf

A Spark DataFrame or local data frame.

by

Character. Column name to split by. Its distinct values become the branches. NA values are grouped as "(NA)".

label

Optional character. A human-readable name for this split level (e.g. "Exposure: drought"). Defaults to by.

value_labels

Optional named character vector mapping raw values to readable labels, e.g. c("0" = "Sem seca", "1" = "Com seca"). If not given, the function tries factor levels / labelled attributes on the column; failing that, uses the raw value.

max_levels

Integer. Safety cap on nesting depth. Default 3.

Value

sdf unchanged (for piping).

Examples

## Not run: 
df %>%
  track_split(by = "exposto_seca", label = "Exposição: seca",
              value_labels = c("0" = "Sem seca", "1" = "Com seca")) %>%
  track_split(by = "migrou", label = "Mediador: migração",
              value_labels = c("0" = "Não migrou", "1" = "Migrou")) %>%
  track_outcomes(c("obito_dcv", "obito_infec"))

## End(Not run)
## Not run: 
df %>%
  track_split(by = "exposto_seca", label = "Exposição: seca",
              value_labels = c("0" = "Sem seca", "1" = "Com seca")) %>%
  track_split(by = "migrou", label = "Mediador: migração",
              value_labels = c("0" = "Não migrou", "1" = "Migrou")) %>%
  track_outcomes(c("obito_dcv", "obito_infec"))

## End(Not run)

Record a tracking step

Description

Counts unique individuals in the current data and logs the step. Works with both tbl_spark and local data frames.

Usage

track_step(sdf, step_label, description = "", assume_unique = FALSE)
track_step(sdf, step_label, description = "", assume_unique = FALSE)

Arguments

sdf

A Spark DataFrame or local data frame.

step_label

Short label for the step.

description

Longer description.

assume_unique

Logical. If TRUE, skips the distinct() call when counting (use only when you are certain the ID column has no duplicates at this stage - e.g. after a deduplication step). Default: FALSE.

Value

Invisible integer: number of unique IDs.

Package 'autocodebook'

Help Index

Filter with automatic tracking

Description

Usage

Arguments

Details

Value

Mutate with automatic codebook registration

Description

Usage

Arguments

Value

Summarise with automatic codebook registration

Description

Usage

Arguments

Value

Checkpoint a Spark DataFrame

Description

Usage

Arguments

Value

Export codebook to file

Description

Usage

Arguments

Details

Value

Get the current codebook as a tibble

Description

Usage

Value

Initialize autocodebook session

Description

Usage

Arguments

Value

Examples

Manually register a variable in the codebook

Description

Usage

Arguments

Value

Render the codebook as a gt table

Description

Usage

Arguments

Value

Reset the codebook (clear all entries)

Description

Usage

Value

Toggle default caching for big-data verbs

Description

Usage

Arguments

Value

Toggle verbose diagnostic messages

Description

Usage

Arguments

Value

Get the current flow tree (raw structure)

Description

Usage

Value

Reset the flow tree

Description

Usage

Value

Flow tree as a tidy table

Description

Usage

Value

Generate a standardized report from the current session

Description

Usage

Arguments

Details