Can you over-engineer data cleaning? Let’s find out!

Ben Harrap

2025-04-17

Intro

About me

Doing data things since 2017:

  • Randomised trials
  • Observational studies
  • Longitudinal surveys
  • Administrative data
  • Data linkage
  • REDCap (and similar)

Currently work at Yardhura Walani on the Mayi Kuwayu Study

About the Mayi Kuwayu Study

The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.

  • ~12,000 responses to Wave 1
  • Currently at the tail-end of Wave 2
  • Questionnaire provided in paper and online formats
  • Indigenous Data Sovereignty and Governance
  • Won the NHMRC David Cooper Clinical Trials and Cohort Studies Award

About the data cleaning rework

Guiding principles:

  • Write code that I would like to inherit
  • Talk to people interacting with data
    • Data entry, data analysis, participant admin, survey design
  • Bring people along
    • Exposure to new ways of doing stuff
    • Professional development/upskilling
  • Be open to being wrong

About the data cleaning rework

Packages & stuff:

  • tidyverse
  • targets
  • testthat
  • pointblank
  • REDCapR
  • janitor
  • conflicted
  • renv
  • OCR tools
  • REDCap
  • Quarto
    • _brand.yml
  • git/GitHub

Data setup

flowchart TD
    paper[Paper response] -->|digitsed| scan[Compiled in csv]
    scan -->|uploaded| redcap[REDCap]
    online[Online response] -->|direct| redcap
    dictionary[Dictionary] --> redcap
    audit[Audits] --> redcap

Data dictionary

Data dictionary

Exported from REDCap using the API

dictionary_raw <- redcap_metadata_read(
  redcap_uri = credential$redcap_uri, 
  token = credential$token)$data

The ‘raw’ dictionary might look like:

field_name field_type select_response_option_values
id text
age text
children text
online_employment radio 1, Full-time | 2, Part-time | 3, Casual | 4, Retired
paper_employment text 1, Full-time | 2, Part-time | 3, Casual | 4, Retired

Need to have online and paper data separately

  • Minor differences between questionnaire formats
  • Can’t force field validation on paper format

Data dictionary

Raw dictionary from REDCap needs cleaning:

  • Dual purpose output
    • Reference for end-users
    • Used in data cleaning
  • Lots of regex
  • Coding it makes the dictionary reproducible
variable type label values
id character ID
age numeric Age in years
children numeric Number of children
employment factor Employment category 1, Full-time | 2, Part-time | 3, Casual | 4, Retired

Data dictionary

From the cleaned dictionary, also create a value-label correspondence

variable type value value_label
employment factor 1 Full-time
employment factor 2 Part-time
employment factor 3 Casual
employment factor 4 Retired

Use it to apply labels to data once it’s been cleaned

Data cleaning

Overview

flowchart TD
    online[Raw online] -->|pre-cleaning| combined[Harmonised data]
    paper[Raw paper] -->|pre-cleaning| combined
    combined --> preclean[Ready for cleaning]
    audit[Audit data] --> preclean
    preclean -->|cleaning| int[Integer]
    preclean -->|cleaning| num[Numeric]
    preclean -->|cleaning| dte[Date]
    preclean -->|cleaning| fct[Factor]
    preclean -->|cleaning| str[String]
    int -->|re-combine| clean[Clean data]
    num -->|re-combine| clean
    dte -->|re-combine| clean
    fct -->|re-combine| clean
    str -->|re-combine| clean
    ext[External data] -->|integrate| clean

Pre-cleaning

Apply consistent coding

data_prepared_paper <- data_raw_paper |>
   mutate(
    # The paper data is coded as 1/2 (yes/no) whereas the online is coded as 0/1 (no/yes)
    consent = case_when(
      consent == "2" ~ "0",
      consent == "1" ~ "1",
      .default = consent
      )) 

Rename/remove variables

data_prepared_online <- data_raw_online |>
  # Remove `__1` from variable names which REDCap creates for checkbox questions
  rename_with(~ str_replace(., "___1", ""), matches("___1")) |>
  # remove the leading "online_" so we can combine with the paper data
  rename_with(~ str_replace(., "online_", ""), starts_with("online_")) |> 
  select(-starts_with("honeypot"))

Check with compare_df_cols()

Example data

For the purposes of illustration:

id source age children employment
1 online 25 1 1
2 online 55.2 NA 1
3 paper ^2 one 4

Validation

First step, reshape the data to run validation on one column:

data_validate_integer <- data_raw |> 
  select(id, source, matches(dictionary_clean$variable[dictionary_clean$type == "numeric"])) |>
  pivot_longer(!c(id, source),
               names_to = "variable",
               values_to = "response")

data_validate_integer
# A tibble: 6 × 4
  id    source variable response
  <chr> <chr>  <chr>    <chr>   
1 1     online age      25      
2 1     online children 1       
3 2     online age      55.2    
4 2     online children <NA>    
5 3     paper  age      ^2      
6 3     paper  children one     

Validation

Cleaning & validation occurring at the same time, following the pattern of rule_* and validate_*:

data_validate_integer <- data_validate_integer |> 
  mutate(
    rule_only_digits = if_else(str_detect(response,"[^-|\\d+|\\.]+"),FALSE,TRUE,FALSE),
    validate_only_digits = case_when(
      rule_only_digits == FALSE ~ str_remove(response,"\\D+"),
      .default = response
    ))

data_validate_integer
# A tibble: 6 × 6
  id    source variable response rule_only_digits validate_only_digits
  <chr> <chr>  <chr>    <chr>    <lgl>            <chr>               
1 1     online age      25       TRUE             "25"                
2 1     online children 1        TRUE             "1"                 
3 2     online age      55.2     TRUE             "55.2"              
4 2     online children <NA>     FALSE             <NA>               
5 3     paper  age      ^2       FALSE            "2"                 
6 3     paper  children one      FALSE            ""                  

Auditing

flowchart LR
    rawdata[Raw data] --> outcome{Validation}
    outcome -->|pass| clean[Clean data]
    outcome -->|fail| audit[Returned for audit]
    audit -->|uploaded| redcap[REDCap form]
    redcap -->|audited| correct[Correct values]
    correct -->|updates| rawdata

Any observations that fail an audit rule and are from a paper response get audited

  • Checked against original scan
  • Updated values entered in REDCap
  • Easy for data entry team

Auditing

Feedback loop between the raw data and data validation output

  • Data entry team check the audit list when new audits are required
  • Fill in a REDCap form after checking the scanned original
  • This gets incorporated back into cleaning

For example:

id variable old_value new_value
3 children one 1
3 age ^2 72

Auditing

First pivot dataset into the right format

data_prepared <- data_raw |> 
  pivot_longer(
    all_of(data_audit$variable),
    names_to = "variable",
    values_to = "value"
    )

data_prepared
# A tibble: 6 × 5
  id    source employment variable value
  <chr> <chr>  <chr>      <chr>    <chr>
1 1     online 1          children 1    
2 1     online 1          age      25   
3 2     online 1          children <NA> 
4 2     online 1          age      55.2 
5 3     paper  4          children one  
6 3     paper  4          age      ^2   

Then update the rows

data_prepared <- data_prepared |> 
  rows_update(
    data_audit |> 
      rename(value = new_value) |> 
      select(value, variable, id),
    by = c("id","variable")
  )
data_prepared
# A tibble: 6 × 5
  id    source employment variable value
  <chr> <chr>  <chr>      <chr>    <chr>
1 1     online 1          children 1    
2 1     online 1          age      25   
3 2     online 1          children <NA> 
4 2     online 1          age      55.2 
5 3     paper  4          children 1    
6 3     paper  4          age      72   

Then pivot_wider() back to the original shape.

Labelling

Automating the labels was fun - first generate a list of the value/label correspondence

# Collapse the labels (which are currently in rows) into lists
dictionary_label_lists <- dictionary_labels |> 
  # For each unique variable
  group_by(variable) |> 
  # Generate the label/value correspondence as a list
  summarise(label_values = list(setNames(value, value_label)))

dictionary_label_lists
# A tibble: 1 × 2
  variable   label_values
  <chr>      <list>      
1 employment <chr [4]>   
unlist(dictionary_label_lists$label_values)
Full-time Part-time    Casual   Retired 
      "1"       "2"       "3"       "4" 

Labelling

Then apply the list using across() and cur_column()

data_prepared |> mutate(across(any_of(dictionary_clean$variable[dictionary_clean$type == "factor"]), ~
    {
      # Get the labels for the current variable
      labels <- dictionary_label_lists |>
        filter(variable == cur_column()) |>
        pull(label_values) |>
        first()
      # Apply the labels to the current variable
      labelled(.x, labels)
    }))
# A tibble: 6 × 5
  id    source employment    variable value
  <chr> <chr>  <chr+lbl>     <chr>    <chr>
1 1     online 1 [Full-time] children 1    
2 1     online 1 [Full-time] age      25   
3 2     online 1 [Full-time] children <NA> 
4 2     online 1 [Full-time] age      55.2 
5 3     paper  4 [Retired]   children 1    
6 3     paper  4 [Retired]   age      72   

Extra stuff

Integrate new data, make new variables

  • Create commonly computed variables
    • Age brackets
    • Employment and study categories
  • Calculate survey weights
  • Integrate external data
    • Address
    • Geographies

Not just data cleaning

Overview

flowchart TD
    code[Codebase] --> main[Maintainer manual]
    data[Clean data] -->  dict[Data dictionary]
    data --> qual[Data cleaning report]
    qual --> rede[Survey redesign]
    data --> main
    qual --> user[User manual]
    dict --> user

Code maintainer manual

A technical document that covers:

  • All of the technology involved
    • Why
    • How to set it up
  • Explainers of the code
    • Conceptual goal
    • Snippets
    • Reasons for given implementation

Data cleaning reports

As the cleaning code is run, it plucks out summaries along the way:

report_data_quality <- tibble(
  stage = "Import",
  item = "Count of audit records",
  value = as.character(nrow(data_audit))
) |> 
  add_row(
    stage = "Import",
    item = "Count of records from paper instrument",
    value = as.character(nrow(data_raw_paper))
  )

pointblank can also create validation reports

Report on:

  • Number of observations
    • At each step
    • Failing validation
    • Dropped, with reasons
  • Missingness
  • Distribution of responses
    • Tables/figures
    • Statistical tests

Data user manual

A relatively plain-language, accessible user manual

  • Mayi Kuwayu details
    • Study design
    • Implementation
  • Cleaning details
    • Overview
    • External data sources
    • Calculated variables
    • Important decisions & idiosyncrasies
  • Data details
    • Context
    • Validation rules
    • Quality
    • Dictionary

Uses Quarto, incorporates output from the pipeline

Variable naming convention

See blog post: https://benharrap.com/post/2025-03-03-variable-naming-convention/

Landed on {wave}_{theme}_{question wording}_{optional_identifier}, where:

  • {wave} = current wave, 3 characters
  • {theme} = question theme, 7 characters
  • {wording} = question wording, 7 characters
  • {optional identifier} = optional suffix, maximum 6 characters

For example:

  • w02_culture_activty
  • w02_contact_phoneno_1
  • w02_contact_phoneno_2

Survey redesign

Shiny dashboard with relevant information about each wave:

  • Question performance
    • Missingness
  • Response distributions
    • Limited variability
    • Floor/ceiling effects
  • Missingness patterns
  • Question sequence violations
    • Paper only

Code generalisability

Currently working on code for Wave 2, but intend for the code to:

  • Generalise across waves
    • E.g. validation rules for variable types
  • Generalise across studies
  • Provide a framework for future studies

Pretty cool huh

Problems (and solutions?)

  • Microsoft Excel
    • Don’t use it
  • I’m an academic
    • Square peg |> round hole
  • I’m busy
    • ???
  • Someone else needs to maintain it
    • Train people
    • Write good code

Questions for you

What metrics do you use when checking data cleaning output?

When you inherit code, what are the problems you face and how would you mitigate them?

What makes for a good code maintainer manual?

What makes for a good data user manual?