| variable | label | type | options | min | max |
|---|---|---|---|---|---|
| id | Participant ID | string | |||
| age | Age in years | numeric | 16 | 130 | |
| dob | Date of birth | date | 1900-01-01 | 2006-01-01 | |
| employment | Employment status | radio | 1, Full-time 2, Part-time 3, Casual 4, Retired |
1 | 4 |
And other difficult questions your children might ask
Doing data things since 2017:
Currently work at Yardhura Walani on the Mayi Kuwayu Study
The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.
Daddy, where do data come from?
Several consenting adults get together and decide to make a…
Several consenting adults get together and decide to make a…

The first one arrives, up it goes
More appear, yay!
Wow when is it going to stop
Much like drawings on the fridge, every one is precious
Unlike those drawings, every response is retained and stored somewhere

We need a platform that does
Online distribution changes things
Daddy, how do you turn the paper questionnaires into numbers on the screen

Daddy, can we give the data a bath tonight?
The data dictionary is essential for cleaning and documentation:
Use the dictionary instead of hardcoding values!
| variable | label | type | options | min | max |
|---|---|---|---|---|---|
| id | Participant ID | string | |||
| age | Age in years | numeric | 16 | 130 | |
| dob | Date of birth | date | 1900-01-01 | 2006-01-01 | |
| employment | Employment status | radio | 1, Full-time 2, Part-time 3, Casual 4, Retired |
1 | 4 |
across(
.cols = any_of(dictionary |> filter(type %in% c("numeric", "integer")) |> pull(variable)),
.fns = \(x) {
min <- dictionary |> filter(variable == cur_column()) |> pull(minimum) |> as.numeric()
max <- dictionary |> filter(variable == cur_column()) |> pull(maximum) |> as.numeric()
if_else(
condition = between(as.numeric(x), min, max),
true = x,
false = "-777777",
missing = NA
)
}Deriving from existing data
Incorporating external data
Daddy, have you thought about using metadata to inform the questionnaire redesign process?
| id | name_first | name_last | age | x1 | x2 | x3 | x_text | y1 | y2 | y3 | z1 | z2 | z3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tom | Smith | 62 | 1 | Lorem | 1 | |||||||
| 2 | Penny | Jones | 52 | 0 | 1 | 1 | 0 | 1 | 0 | ||||
| 3 | Trevor | 37 | 1 | 0 | |||||||||
| 4 | Ursula | Smith | 44 | 1 | Lorem | 1 | 1 | 0 | |||||
| 5 | Jenny | Jones | 0 | 1 | 1 | 0 |
| id | name_first | name_last | age | x1 | x2 | x3 | x_text | y1 | y2 | y3 | z1 | z2 | z3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tom | Smith | 62 | 1 | Lorem | 1 | |||||||
| 2 | Penny | Jones | 52 | 0 | 1 | 1 | 0 | 1 | 0 | ||||
| 3 | Trevor | 37 | 1 | 0 | |||||||||
| 4 | Ursula | Smith | 44 | 1 | Lorem | 1 | 1 | 0 | |||||
| 5 | Jenny | Jones | 0 | 1 | 1 | 0 |
| id | name_first | name_last | age | x1 | x2 | x3 | x_text | y1 | y2 | y3 | z1 | z2 | z3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tom | Smith | 62 | 1 | Lorem | 1 | |||||||
| 2 | Penny | Jones | 52 | 0 | 1 | 1 | 0 | 1 | 0 | ||||
| 3 | Trevor | 37 | 1 | 0 | |||||||||
| 4 | Ursula | Smith | 44 | 1 | Lorem | 1 | 1 | 0 | |||||
| 5 | Jenny | Jones | 0 | 1 | 1 | 0 |
| id | name_first | name_last | age | x1 | x2 | x3 | x_text | y1 | y2 | y3 | z1 | z2 | z3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tom | Smith | 62 | 1 | Lorem | 1 | |||||||
| 2 | Penny | Jones | 52 | 0 | 1 | 1 | 0 | 1 | 0 | ||||
| 3 | Trevor | 37 | 1 | 0 | |||||||||
| 4 | Ursula | Smith | 44 | 1 | Lorem | 1 | 1 | 0 | |||||
| 5 | Jenny | Jones | 0 | 1 | 1 | 0 |
Check invalid responses
Check valid responses

Slides available from https://benharrap.github.io/workshops/2025-where-do-data-come-from/slides.html