flowchart TD
paper[Paper response] -->|digitsed| scan[Compiled in csv]
scan -->|uploaded| redcap[REDCap]
online[Online response] -->|direct| redcap
dictionary[Dictionary] --> redcap
audit[Audits] --> redcap
2025-04-17
Doing data things since 2017:
Currently work at Yardhura Walani on the Mayi Kuwayu Study
The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.
Guiding principles:
Packages & stuff:
tidyversetargetstestthatpointblankREDCapRjanitorconflictedrenv_brand.ymlflowchart TD
paper[Paper response] -->|digitsed| scan[Compiled in csv]
scan -->|uploaded| redcap[REDCap]
online[Online response] -->|direct| redcap
dictionary[Dictionary] --> redcap
audit[Audits] --> redcap
Exported from REDCap using the API
The ‘raw’ dictionary might look like:
| field_name | field_type | select_response_option_values |
|---|---|---|
| id | text | |
| age | text | |
| children | text | |
| online_employment | radio | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
| paper_employment | text | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
Need to have online and paper data separately
Raw dictionary from REDCap needs cleaning:
| variable | type | label | values |
|---|---|---|---|
| id | character | ID | |
| age | numeric | Age in years | |
| children | numeric | Number of children | |
| employment | factor | Employment category | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
From the cleaned dictionary, also create a value-label correspondence
| variable | type | value | value_label |
|---|---|---|---|
| employment | factor | 1 | Full-time |
| employment | factor | 2 | Part-time |
| employment | factor | 3 | Casual |
| employment | factor | 4 | Retired |
Use it to apply labels to data once it’s been cleaned
flowchart TD
online[Raw online] -->|pre-cleaning| combined[Harmonised data]
paper[Raw paper] -->|pre-cleaning| combined
combined --> preclean[Ready for cleaning]
audit[Audit data] --> preclean
preclean -->|cleaning| int[Integer]
preclean -->|cleaning| num[Numeric]
preclean -->|cleaning| dte[Date]
preclean -->|cleaning| fct[Factor]
preclean -->|cleaning| str[String]
int -->|re-combine| clean[Clean data]
num -->|re-combine| clean
dte -->|re-combine| clean
fct -->|re-combine| clean
str -->|re-combine| clean
ext[External data] -->|integrate| clean
Apply consistent coding
Rename/remove variables
data_prepared_online <- data_raw_online |>
# Remove `__1` from variable names which REDCap creates for checkbox questions
rename_with(~ str_replace(., "___1", ""), matches("___1")) |>
# remove the leading "online_" so we can combine with the paper data
rename_with(~ str_replace(., "online_", ""), starts_with("online_")) |>
select(-starts_with("honeypot"))Check with compare_df_cols()
For the purposes of illustration:
| id | source | age | children | employment |
|---|---|---|---|---|
| 1 | online | 25 | 1 | 1 |
| 2 | online | 55.2 | NA | 1 |
| 3 | paper | ^2 | one | 4 |
First step, reshape the data to run validation on one column:
data_validate_integer <- data_raw |>
select(id, source, matches(dictionary_clean$variable[dictionary_clean$type == "numeric"])) |>
pivot_longer(!c(id, source),
names_to = "variable",
values_to = "response")
data_validate_integer# A tibble: 6 × 4
id source variable response
<chr> <chr> <chr> <chr>
1 1 online age 25
2 1 online children 1
3 2 online age 55.2
4 2 online children <NA>
5 3 paper age ^2
6 3 paper children one
Cleaning & validation occurring at the same time, following the pattern of rule_* and validate_*:
data_validate_integer <- data_validate_integer |>
mutate(
rule_only_digits = if_else(str_detect(response,"[^-|\\d+|\\.]+"),FALSE,TRUE,FALSE),
validate_only_digits = case_when(
rule_only_digits == FALSE ~ str_remove(response,"\\D+"),
.default = response
))
data_validate_integer# A tibble: 6 × 6
id source variable response rule_only_digits validate_only_digits
<chr> <chr> <chr> <chr> <lgl> <chr>
1 1 online age 25 TRUE "25"
2 1 online children 1 TRUE "1"
3 2 online age 55.2 TRUE "55.2"
4 2 online children <NA> FALSE <NA>
5 3 paper age ^2 FALSE "2"
6 3 paper children one FALSE ""
flowchart LR
rawdata[Raw data] --> outcome{Validation}
outcome -->|pass| clean[Clean data]
outcome -->|fail| audit[Returned for audit]
audit -->|uploaded| redcap[REDCap form]
redcap -->|audited| correct[Correct values]
correct -->|updates| rawdata
Any observations that fail an audit rule and are from a paper response get audited
Feedback loop between the raw data and data validation output
For example:
| id | variable | old_value | new_value |
|---|---|---|---|
| 3 | children | one | 1 |
| 3 | age | ^2 | 72 |
First pivot dataset into the right format
data_prepared <- data_raw |>
pivot_longer(
all_of(data_audit$variable),
names_to = "variable",
values_to = "value"
)
data_prepared# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr> <chr> <chr>
1 1 online 1 children 1
2 1 online 1 age 25
3 2 online 1 children <NA>
4 2 online 1 age 55.2
5 3 paper 4 children one
6 3 paper 4 age ^2
Then update the rows
data_prepared <- data_prepared |>
rows_update(
data_audit |>
rename(value = new_value) |>
select(value, variable, id),
by = c("id","variable")
)
data_prepared# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr> <chr> <chr>
1 1 online 1 children 1
2 1 online 1 age 25
3 2 online 1 children <NA>
4 2 online 1 age 55.2
5 3 paper 4 children 1
6 3 paper 4 age 72
Then pivot_wider() back to the original shape.
Automating the labels was fun - first generate a list of the value/label correspondence
# Collapse the labels (which are currently in rows) into lists
dictionary_label_lists <- dictionary_labels |>
# For each unique variable
group_by(variable) |>
# Generate the label/value correspondence as a list
summarise(label_values = list(setNames(value, value_label)))
dictionary_label_lists# A tibble: 1 × 2
variable label_values
<chr> <list>
1 employment <chr [4]>
Full-time Part-time Casual Retired
"1" "2" "3" "4"
Then apply the list using across() and cur_column()
data_prepared |> mutate(across(any_of(dictionary_clean$variable[dictionary_clean$type == "factor"]), ~
{
# Get the labels for the current variable
labels <- dictionary_label_lists |>
filter(variable == cur_column()) |>
pull(label_values) |>
first()
# Apply the labels to the current variable
labelled(.x, labels)
}))# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr+lbl> <chr> <chr>
1 1 online 1 [Full-time] children 1
2 1 online 1 [Full-time] age 25
3 2 online 1 [Full-time] children <NA>
4 2 online 1 [Full-time] age 55.2
5 3 paper 4 [Retired] children 1
6 3 paper 4 [Retired] age 72
Integrate new data, make new variables
flowchart TD
code[Codebase] --> main[Maintainer manual]
data[Clean data] --> dict[Data dictionary]
data --> qual[Data cleaning report]
qual --> rede[Survey redesign]
data --> main
qual --> user[User manual]
dict --> user
A technical document that covers:
As the cleaning code is run, it plucks out summaries along the way:
pointblank can also create validation reports
Report on:
A relatively plain-language, accessible user manual
Uses Quarto, incorporates output from the pipeline
See blog post: https://benharrap.com/post/2025-03-03-variable-naming-convention/
Landed on {wave}_{theme}_{question wording}_{optional_identifier}, where:
{wave} = current wave, 3 characters{theme} = question theme, 7 characters{wording} = question wording, 7 characters{optional identifier} = optional suffix, maximum 6 charactersFor example:
w02_culture_activtyw02_contact_phoneno_1w02_contact_phoneno_2Shiny dashboard with relevant information about each wave:
Currently working on code for Wave 2, but intend for the code to:
What metrics do you use when checking data cleaning output?
When you inherit code, what are the problems you face and how would you mitigate them?
What makes for a good code maintainer manual?
What makes for a good data user manual?

Slides available from https://benharrap.github.io/workshops/2025-04-17-data-cleaning-bdsi/slides.html