flowchart TD paper[Paper response] -->|digitsed| scan[Compiled in csv] scan -->|uploaded| redcap[REDCap] online[Online response] -->|direct| redcap dictionary[Dictionary] --> redcap audit[Audits] --> redcap
2025-04-17
Doing data things since 2017:
Currently work at Yardhura Walani on the Mayi Kuwayu Study
The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.
Guiding principles:
Packages & stuff:
tidyverse
targets
testthat
pointblank
REDCapR
janitor
conflicted
renv
_brand.yml
flowchart TD paper[Paper response] -->|digitsed| scan[Compiled in csv] scan -->|uploaded| redcap[REDCap] online[Online response] -->|direct| redcap dictionary[Dictionary] --> redcap audit[Audits] --> redcap
Exported from REDCap using the API
The ‘raw’ dictionary might look like:
field_name | field_type | select_response_option_values |
---|---|---|
id | text | |
age | text | |
children | text | |
online_employment | radio | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
paper_employment | text | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
Need to have online
and paper
data separately
Raw dictionary from REDCap needs cleaning:
variable | type | label | values |
---|---|---|---|
id | character | ID | |
age | numeric | Age in years | |
children | numeric | Number of children | |
employment | factor | Employment category | 1, Full-time | 2, Part-time | 3, Casual | 4, Retired |
From the cleaned dictionary, also create a value-label correspondence
variable | type | value | value_label |
---|---|---|---|
employment | factor | 1 | Full-time |
employment | factor | 2 | Part-time |
employment | factor | 3 | Casual |
employment | factor | 4 | Retired |
Use it to apply labels to data once it’s been cleaned
flowchart TD online[Raw online] -->|pre-cleaning| combined[Harmonised data] paper[Raw paper] -->|pre-cleaning| combined combined --> preclean[Ready for cleaning] audit[Audit data] --> preclean preclean -->|cleaning| int[Integer] preclean -->|cleaning| num[Numeric] preclean -->|cleaning| dte[Date] preclean -->|cleaning| fct[Factor] preclean -->|cleaning| str[String] int -->|re-combine| clean[Clean data] num -->|re-combine| clean dte -->|re-combine| clean fct -->|re-combine| clean str -->|re-combine| clean ext[External data] -->|integrate| clean
Apply consistent coding
Rename/remove variables
data_prepared_online <- data_raw_online |>
# Remove `__1` from variable names which REDCap creates for checkbox questions
rename_with(~ str_replace(., "___1", ""), matches("___1")) |>
# remove the leading "online_" so we can combine with the paper data
rename_with(~ str_replace(., "online_", ""), starts_with("online_")) |>
select(-starts_with("honeypot"))
Check with compare_df_cols()
For the purposes of illustration:
id | source | age | children | employment |
---|---|---|---|---|
1 | online | 25 | 1 | 1 |
2 | online | 55.2 | NA | 1 |
3 | paper | ^2 | one | 4 |
First step, reshape the data to run validation on one column:
data_validate_integer <- data_raw |>
select(id, source, matches(dictionary_clean$variable[dictionary_clean$type == "numeric"])) |>
pivot_longer(!c(id, source),
names_to = "variable",
values_to = "response")
data_validate_integer
# A tibble: 6 × 4
id source variable response
<chr> <chr> <chr> <chr>
1 1 online age 25
2 1 online children 1
3 2 online age 55.2
4 2 online children <NA>
5 3 paper age ^2
6 3 paper children one
Cleaning & validation occurring at the same time, following the pattern of rule_*
and validate_*
:
data_validate_integer <- data_validate_integer |>
mutate(
rule_only_digits = if_else(str_detect(response,"[^-|\\d+|\\.]+"),FALSE,TRUE,FALSE),
validate_only_digits = case_when(
rule_only_digits == FALSE ~ str_remove(response,"\\D+"),
.default = response
))
data_validate_integer
# A tibble: 6 × 6
id source variable response rule_only_digits validate_only_digits
<chr> <chr> <chr> <chr> <lgl> <chr>
1 1 online age 25 TRUE "25"
2 1 online children 1 TRUE "1"
3 2 online age 55.2 TRUE "55.2"
4 2 online children <NA> FALSE <NA>
5 3 paper age ^2 FALSE "2"
6 3 paper children one FALSE ""
flowchart LR rawdata[Raw data] --> outcome{Validation} outcome -->|pass| clean[Clean data] outcome -->|fail| audit[Returned for audit] audit -->|uploaded| redcap[REDCap form] redcap -->|audited| correct[Correct values] correct -->|updates| rawdata
Any observations that fail an audit rule and are from a paper response get audited
Feedback loop between the raw data and data validation output
For example:
id | variable | old_value | new_value |
---|---|---|---|
3 | children | one | 1 |
3 | age | ^2 | 72 |
First pivot dataset into the right format
data_prepared <- data_raw |>
pivot_longer(
all_of(data_audit$variable),
names_to = "variable",
values_to = "value"
)
data_prepared
# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr> <chr> <chr>
1 1 online 1 children 1
2 1 online 1 age 25
3 2 online 1 children <NA>
4 2 online 1 age 55.2
5 3 paper 4 children one
6 3 paper 4 age ^2
Then update the rows
data_prepared <- data_prepared |>
rows_update(
data_audit |>
rename(value = new_value) |>
select(value, variable, id),
by = c("id","variable")
)
data_prepared
# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr> <chr> <chr>
1 1 online 1 children 1
2 1 online 1 age 25
3 2 online 1 children <NA>
4 2 online 1 age 55.2
5 3 paper 4 children 1
6 3 paper 4 age 72
Then pivot_wider()
back to the original shape.
Automating the labels was fun - first generate a list of the value/label correspondence
# Collapse the labels (which are currently in rows) into lists
dictionary_label_lists <- dictionary_labels |>
# For each unique variable
group_by(variable) |>
# Generate the label/value correspondence as a list
summarise(label_values = list(setNames(value, value_label)))
dictionary_label_lists
# A tibble: 1 × 2
variable label_values
<chr> <list>
1 employment <chr [4]>
Full-time Part-time Casual Retired
"1" "2" "3" "4"
Then apply the list using across()
and cur_column()
data_prepared |> mutate(across(any_of(dictionary_clean$variable[dictionary_clean$type == "factor"]), ~
{
# Get the labels for the current variable
labels <- dictionary_label_lists |>
filter(variable == cur_column()) |>
pull(label_values) |>
first()
# Apply the labels to the current variable
labelled(.x, labels)
}))
# A tibble: 6 × 5
id source employment variable value
<chr> <chr> <chr+lbl> <chr> <chr>
1 1 online 1 [Full-time] children 1
2 1 online 1 [Full-time] age 25
3 2 online 1 [Full-time] children <NA>
4 2 online 1 [Full-time] age 55.2
5 3 paper 4 [Retired] children 1
6 3 paper 4 [Retired] age 72
Integrate new data, make new variables
flowchart TD code[Codebase] --> main[Maintainer manual] data[Clean data] --> dict[Data dictionary] data --> qual[Data cleaning report] qual --> rede[Survey redesign] data --> main qual --> user[User manual] dict --> user
A technical document that covers:
As the cleaning code is run, it plucks out summaries along the way:
pointblank
can also create validation reports
Report on:
A relatively plain-language, accessible user manual
Uses Quarto, incorporates output from the pipeline
See blog post: https://benharrap.com/post/2025-03-03-variable-naming-convention/
Landed on {wave}_{theme}_{question wording}_{optional_identifier}
, where:
{wave}
= current wave, 3 characters{theme}
= question theme, 7 characters{wording}
= question wording, 7 characters{optional identifier}
= optional suffix, maximum 6 charactersFor example:
w02_culture_activty
w02_contact_phoneno_1
w02_contact_phoneno_2
Shiny dashboard with relevant information about each wave:
Currently working on code for Wave 2, but intend for the code to:
What metrics do you use when checking data cleaning output?
When you inherit code, what are the problems you face and how would you mitigate them?
What makes for a good code maintainer manual?
What makes for a good data user manual?
Slides available from https://benharrap.github.io/workshops/2025-04-17-data-cleaning-bdsi/slides.html