Can you over-engineer data cleaning? Let’s find out!

Ben Harrap

2025-04-17

Intro

About me

Doing data things since 2017:

Randomised trials
Observational studies
Longitudinal surveys
Administrative data
Data linkage
REDCap (and similar)

Currently work at Yardhura Walani on the Mayi Kuwayu Study

About the Mayi Kuwayu Study

The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.

~12,000 responses to Wave 1
Currently at the tail-end of Wave 2
Questionnaire provided in paper and online formats
Indigenous Data Sovereignty and Governance
Won the NHMRC David Cooper Clinical Trials and Cohort Studies Award

About the data cleaning rework

Guiding principles:

Write code that I would like to inherit
Talk to people interacting with data
- Data entry, data analysis, participant admin, survey design
Bring people along
- Exposure to new ways of doing stuff
- Professional development/upskilling
Be open to being wrong

About the data cleaning rework

Packages & stuff:

tidyverse
targets
testthat
pointblank
REDCapR
janitor
conflicted
renv

OCR tools
REDCap
Quarto
- _brand.yml
git/GitHub

Data setup

flowchart TD
    paper[Paper response] -->|digitsed| scan[Compiled in csv]
    scan -->|uploaded| redcap[REDCap]
    online[Online response] -->|direct| redcap
    dictionary[Dictionary] --> redcap
    audit[Audits] --> redcap

Data dictionary

Exported from REDCap using the API

dictionary_raw <- redcap_metadata_read(
  redcap_uri = credential$redcap_uri, 
  token = credential$token)$data

The ‘raw’ dictionary might look like:

field_name	field_type	select_response_option_values
id	text
age	text
children	text
online_employment	radio	1, Full-time \| 2, Part-time \| 3, Casual \| 4, Retired
paper_employment	text	1, Full-time \| 2, Part-time \| 3, Casual \| 4, Retired

Need to have online and paper data separately

Minor differences between questionnaire formats
Can’t force field validation on paper format

Data dictionary

Raw dictionary from REDCap needs cleaning:

Dual purpose output
- Reference for end-users
- Used in data cleaning
Lots of regex
Coding it makes the dictionary reproducible

variable	type	label	values
id	character	ID
age	numeric	Age in years
children	numeric	Number of children
employment	factor	Employment category	1, Full-time \| 2, Part-time \| 3, Casual \| 4, Retired

Data dictionary

From the cleaned dictionary, also create a value-label correspondence

variable	type	value	value_label
employment	factor	1	Full-time
employment	factor	2	Part-time
employment	factor	3	Casual
employment	factor	4	Retired

Use it to apply labels to data once it’s been cleaned

Data cleaning

Overview

flowchart TD
    online[Raw online] -->|pre-cleaning| combined[Harmonised data]
    paper[Raw paper] -->|pre-cleaning| combined
    combined --> preclean[Ready for cleaning]
    audit[Audit data] --> preclean
    preclean -->|cleaning| int[Integer]
    preclean -->|cleaning| num[Numeric]
    preclean -->|cleaning| dte[Date]
    preclean -->|cleaning| fct[Factor]
    preclean -->|cleaning| str[String]
    int -->|re-combine| clean[Clean data]
    num -->|re-combine| clean
    dte -->|re-combine| clean
    fct -->|re-combine| clean
    str -->|re-combine| clean
    ext[External data] -->|integrate| clean

Pre-cleaning

Apply consistent coding

data_prepared_paper <- data_raw_paper |>
   mutate(
    # The paper data is coded as 1/2 (yes/no) whereas the online is coded as 0/1 (no/yes)
    consent = case_when(
      consent == "2" ~ "0",
      consent == "1" ~ "1",
      .default = consent
      ))

Rename/remove variables

data_prepared_online <- data_raw_online |>
  # Remove `__1` from variable names which REDCap creates for checkbox questions
  rename_with(~ str_replace(., "___1", ""), matches("___1")) |>
  # remove the leading "online_" so we can combine with the paper data
  rename_with(~ str_replace(., "online_", ""), starts_with("online_")) |> 
  select(-starts_with("honeypot"))

Check with compare_df_cols()

Example data

For the purposes of illustration:

id	source	age	children	employment
1	online	25	1	1
2	online	55.2	NA	1
3	paper	^2	one	4

Validation

First step, reshape the data to run validation on one column:

data_validate_integer <- data_raw |> 
  select(id, source, matches(dictionary_clean$variable[dictionary_clean$type == "numeric"])) |>
  pivot_longer(!c(id, source),
               names_to = "variable",
               values_to = "response")

data_validate_integer

# A tibble: 6 × 4
  id    source variable response
  <chr> <chr>  <chr>    <chr>   
1 1     online age      25      
2 1     online children 1       
3 2     online age      55.2    
4 2     online children <NA>    
5 3     paper  age      ^2      
6 3     paper  children one

Validation

Cleaning & validation occurring at the same time, following the pattern of rule_* and validate_*:

data_validate_integer <- data_validate_integer |> 
  mutate(
    rule_only_digits = if_else(str_detect(response,"[^-|\\d+|\\.]+"),FALSE,TRUE,FALSE),
    validate_only_digits = case_when(
      rule_only_digits == FALSE ~ str_remove(response,"\\D+"),
      .default = response
    ))

data_validate_integer

# A tibble: 6 × 6
  id    source variable response rule_only_digits validate_only_digits
  <chr> <chr>  <chr>    <chr>    <lgl>            <chr>               
1 1     online age      25       TRUE             "25"                
2 1     online children 1        TRUE             "1"                 
3 2     online age      55.2     TRUE             "55.2"              
4 2     online children <NA>     FALSE             <NA>               
5 3     paper  age      ^2       FALSE            "2"                 
6 3     paper  children one      FALSE            ""

Auditing

flowchart LR
    rawdata[Raw data] --> outcome{Validation}
    outcome -->|pass| clean[Clean data]
    outcome -->|fail| audit[Returned for audit]
    audit -->|uploaded| redcap[REDCap form]
    redcap -->|audited| correct[Correct values]
    correct -->|updates| rawdata

Any observations that fail an audit rule and are from a paper response get audited

Checked against original scan
Updated values entered in REDCap
Easy for data entry team

Auditing

Feedback loop between the raw data and data validation output

Data entry team check the audit list when new audits are required
Fill in a REDCap form after checking the scanned original
This gets incorporated back into cleaning

For example:

id	variable	old_value	new_value
3	children	one	1
3	age	^2	72

Auditing

First pivot dataset into the right format

data_prepared <- data_raw |> 
  pivot_longer(
    all_of(data_audit$variable),
    names_to = "variable",
    values_to = "value"
    )

data_prepared

# A tibble: 6 × 5
  id    source employment variable value
  <chr> <chr>  <chr>      <chr>    <chr>
1 1     online 1          children 1    
2 1     online 1          age      25   
3 2     online 1          children <NA> 
4 2     online 1          age      55.2 
5 3     paper  4          children one  
6 3     paper  4          age      ^2

Then update the rows

data_prepared <- data_prepared |> 
  rows_update(
    data_audit |> 
      rename(value = new_value) |> 
      select(value, variable, id),
    by = c("id","variable")
  )
data_prepared

# A tibble: 6 × 5
  id    source employment variable value
  <chr> <chr>  <chr>      <chr>    <chr>
1 1     online 1          children 1    
2 1     online 1          age      25   
3 2     online 1          children <NA> 
4 2     online 1          age      55.2 
5 3     paper  4          children 1    
6 3     paper  4          age      72

Then pivot_wider() back to the original shape.

Labelling

Automating the labels was fun - first generate a list of the value/label correspondence

# Collapse the labels (which are currently in rows) into lists
dictionary_label_lists <- dictionary_labels |> 
  # For each unique variable
  group_by(variable) |> 
  # Generate the label/value correspondence as a list
  summarise(label_values = list(setNames(value, value_label)))

dictionary_label_lists

# A tibble: 1 × 2
  variable   label_values
  <chr>      <list>      
1 employment <chr [4]>

unlist(dictionary_label_lists$label_values)

Full-time Part-time    Casual   Retired 
      "1"       "2"       "3"       "4"

Labelling

Then apply the list using across() and cur_column()

data_prepared |> mutate(across(any_of(dictionary_clean$variable[dictionary_clean$type == "factor"]), ~
    {
      # Get the labels for the current variable
      labels <- dictionary_label_lists |>
        filter(variable == cur_column()) |>
        pull(label_values) |>
        first()
      # Apply the labels to the current variable
      labelled(.x, labels)
    }))

# A tibble: 6 × 5
  id    source employment    variable value
  <chr> <chr>  <chr+lbl>     <chr>    <chr>
1 1     online 1 [Full-time] children 1    
2 1     online 1 [Full-time] age      25   
3 2     online 1 [Full-time] children <NA> 
4 2     online 1 [Full-time] age      55.2 
5 3     paper  4 [Retired]   children 1    
6 3     paper  4 [Retired]   age      72

Extra stuff

Integrate new data, make new variables

Create commonly computed variables
- Age brackets
- Employment and study categories
Calculate survey weights
Integrate external data
- Address
- Geographies

Not just data cleaning

Overview

flowchart TD
    code[Codebase] --> main[Maintainer manual]
    data[Clean data] -->  dict[Data dictionary]
    data --> qual[Data cleaning report]
    qual --> rede[Survey redesign]
    data --> main
    qual --> user[User manual]
    dict --> user

Code maintainer manual

A technical document that covers:

All of the technology involved
- Why
- How to set it up
Explainers of the code
- Conceptual goal
- Snippets
- Reasons for given implementation

Data cleaning reports

As the cleaning code is run, it plucks out summaries along the way:

report_data_quality <- tibble(
  stage = "Import",
  item = "Count of audit records",
  value = as.character(nrow(data_audit))
) |> 
  add_row(
    stage = "Import",
    item = "Count of records from paper instrument",
    value = as.character(nrow(data_raw_paper))
  )

pointblank can also create validation reports

Report on:

Number of observations
- At each step
- Failing validation
- Dropped, with reasons
Missingness
Distribution of responses
- Tables/figures
- Statistical tests

Data user manual

A relatively plain-language, accessible user manual

Mayi Kuwayu details
- Study design
- Implementation

Cleaning details
- Overview
- External data sources
- Calculated variables
- Important decisions & idiosyncrasies

Data details
- Context
- Validation rules
- Quality
- Dictionary

Uses Quarto, incorporates output from the pipeline

Variable naming convention

See blog post: https://benharrap.com/post/2025-03-03-variable-naming-convention/

Landed on {wave}_{theme}_{question wording}_{optional_identifier}, where:

{wave} = current wave, 3 characters
{theme} = question theme, 7 characters
{wording} = question wording, 7 characters
{optional identifier} = optional suffix, maximum 6 characters

For example:

w02_culture_activty
w02_contact_phoneno_1
w02_contact_phoneno_2

Survey redesign

Shiny dashboard with relevant information about each wave:

Question performance
- Missingness
Response distributions
- Limited variability
- Floor/ceiling effects
Missingness patterns
Question sequence violations
- Paper only

Code generalisability

Currently working on code for Wave 2, but intend for the code to:

Generalise across waves
- E.g. validation rules for variable types
Generalise across studies
Provide a framework for future studies

Pretty cool huh

Problems (and solutions?)

Microsoft Excel
- Don’t use it
I’m an academic
- Square peg |> round hole
I’m busy
- ???
Someone else needs to maintain it
- Train people
- Write good code

Questions for you

What metrics do you use when checking data cleaning output?

When you inherit code, what are the problems you face and how would you mitigate them?

What makes for a good code maintainer manual?

What makes for a good data user manual?