Where do data come from?

And other difficult questions your children might ask

Ben Harrap

Acknowledgement

About me

Doing data things since 2017:

  • Randomised trials
  • Observational studies
  • Longitudinal surveys
  • Administrative data
  • Data linkage

Currently work at Yardhura Walani on the Mayi Kuwayu Study

Mayi Kuwayu

The largest national study of Aboriginal and Torres Strait Islander culture, health, and wellbeing.

  • ~12,000 responses to Wave 1
  • Currently at the tail-end of Wave 2
  • Questionnaire provided in paper and online formats
  • Indigenous Data Sovereignty and Governance
  • Won the NHMRC David Cooper Clinical Trials and Cohort Studies Award

Getting data

Daddy, where do data come from?

The birds and the bees

Several consenting adults get together and decide to make a…

The birds and the bees

Several consenting adults get together and decide to make a…

Baby’s first steps

  • They’re very tall for their age
    • (it’s too long)
  • What an interesting outfit
    • (the design has issues)
  • They’re gorgeous
    • (perfect, good to go)
  • It looks just like you
    • ???

Time to collect data

  • Physical mail
    • Initially a sample of Medicare database
    • Here’s a copy for a friend!
  • Electronic mail
    • Past participants
    • Newsletters
  • Social media
  • Conferences
  • Community engagement
    • Partners
    • Roadshow

Paper responses

The first one arrives, up it goes

More appear, yay!

Wow when is it going to stop

Much like drawings on the fridge, every one is precious

Unlike those drawings, every response is retained and stored somewhere

Please hold while data is being entered

Online responses

We need a platform that does

  • Questionnaire design
  • Survey administration
  • Data management

Online distribution changes things

  • More control (better data) 🙂
  • No data entry time (less work) 😀
  • Easier to stop (less data) 😕

Please hold while data is being entered

Digitising data

Daddy, how do you turn the paper questionnaires into numbers on the screen

Computer has a go

  • Checkboxes
  • Radios
    • ✅✅
  • Handwriting
    • åååååååååååååååååååååå

Human has a go

  • Check the computer’s work
    • Fix mistakes
    • Enter hand-written text
  • Approx 10min per questionnaire
    • 6 per hour
    • 20 per day (breaks required!)
    • 100 per week…
    • That’s a lot of people hours!

Human has another go

  • Data auditing
    • Failure of a validation rules
    • Unmatched addresses
  • Refer back to scanned questionnaire (!!)
    • Provide updated value

Nobody has a go

Cleaning data

Daddy, can we give the data a bath tonight?

It’s this simple

Start reading early

The data dictionary is essential for cleaning and documentation:

  • Splitting by type
  • Automating validation
  • Labelling values and variables

Use the dictionary instead of hardcoding values!

variable label type options min max
id Participant ID string
age Age in years numeric 16 130
dob Date of birth date 1900-01-01 2006-01-01
employment Employment status radio 1, Full-time
2, Part-time
3, Casual
4, Retired
1 4
Example
across(
  .cols = any_of(dictionary |> filter(type %in% c("numeric", "integer")) |> pull(variable)),
  .fns = \(x) {
    min <- dictionary |> filter(variable == cur_column()) |> pull(minimum) |> as.numeric()
    max <- dictionary |> filter(variable == cur_column()) |> pull(maximum) |> as.numeric()
    if_else(
      condition = between(as.numeric(x), min, max),
      true = x,
      false = "-777777",
      missing = NA
    )
  }

5 minutes until bath time!

  1. Read in data
  2. Harmonise the data
    • Check variable names
    • Apply consistent coding
  3. Apply exclusions
    • Repeat entries
    • Empty responses
  4. Incorporate audits
  5. Get cleaning

Can you count to 010?

Tidy up whitespace
str_remove_all(
  string = response,
  pattern = "[:space:]"
)
Remove audit text
str_detect(
  string = response,
  pattern = "\\{.+\\}"
)
Check for non-digits
str_detect(
  string = response,
  pattern = "[^-|\\d+|\\.]+"
)
Check for digits
str_detect(
  string = response,
  pattern = "\\d"
)
Check for decimals
str_detect(
  string = response,
  pattern = "\\."
)

as.character(round_half_up(as.numeric(response)))
Does it parse as int/num?
is.na(as.integer(response))

My favourite month is…

Did the scanner struggle?
str_detect(
  string = response,
  pattern = "\\^"
)
Check for non-digits
str_detect(
  string = response,
  pattern = "[^[:digit:]-]"
)
Check for separators
str_detect(
  string = response,
  pattern = "-"
)
Check the value of each element
map_dbl(
  .x = response,
  .f = \(x) {
    min(
      get_numbers_from_string(
        string = x,
        separating_pattern = "-"
      )
    )
  }
)
Check for ambiguous dates
is_valid_date_format(
        variable = response,
        format = "%d-%m",
        exact = TRUE
      ) &
        # month-year not missing
        is_valid_date_format(
          variable = response,
          format = c("%m-%y", "%m-%Y"),
          exact = TRUE
        )
Check for missing days
is_valid_date_format(
        variable = validate_unambiguous_date,
        format = c("%m-%y", "%m-%Y"),
        exact = TRUE
      )

Can you write your name?

Standardise strings
str_to_lower(condense_whitespace(response))
Check for empty strings
str_remove_all(
  string = response,
  pattern = "[:space:]"
) == ""
Did the scanner struggle?
str_remove_all(
  string = response,
  pattern = "[:space:]"
) == ""
Did the scanner hallucinate?
str_detect(
  string = response,
  pattern = "speech or language impairment"
)
Did the scanner hallucinate?
str_detect(
  string = response,
  pattern = "[:alnum:]"
)
Check for unusual characters
str_detect(
        string = response,
        pattern = '[^\\w\\d\\s\\!\\?\\-\\(\\)\\[\\]\'"\\.,;:/@#&$+%=><*~\\|]'
      )
Check specific formatting
variable == "email" & is_email(response)
variable == "postcode" & nchar(response) != 4
variable == "phone" & !(nchar(response) %in% c(8, 10, 11))

Playing with friends

Deriving from existing data

  • Back page text
    • Transcription and coding
  • Survey weights
  • Standardised variables
    • Commonly used demographics
    • Validated measures

Incorporating external data

  • Address/geography
    • G-NAF/Geoscape via API
    • ABS data products
    • Check for pre-existing matched addresses
  • (Internal) data linkage
    • Within- and across- wave responders
    • Across studies
    • Unique identifier

Nearly time for bed

  • Encode missingness
  • Convert to appropriate data type
  • Final validation
  • Variable naming
  • Labelling
  • Deidentification

Data about data

Daddy, have you thought about using metadata to inform the questionnaire redesign process?

Missingness

id name_first name_last age x1 x2 x3 x_text y1 y2 y3 z1 z2 z3
1 Tom Smith 62 1 Lorem 1
2 Penny Jones 52 0 1 1 0 1 0
3 Trevor 37 1 0
4 Ursula Smith 44 1 Lorem 1 1 0
5 Jenny Jones 0 1 1 0
id name_first name_last age x1 x2 x3 x_text y1 y2 y3 z1 z2 z3
1 Tom Smith 62 1 Lorem 1
2 Penny Jones 52 0 1 1 0 1 0
3 Trevor 37 1 0
4 Ursula Smith 44 1 Lorem 1 1 0
5 Jenny Jones 0 1 1 0
id name_first name_last age x1 x2 x3 x_text y1 y2 y3 z1 z2 z3
1 Tom Smith 62 1 Lorem 1
2 Penny Jones 52 0 1 1 0 1 0
3 Trevor 37 1 0
4 Ursula Smith 44 1 Lorem 1 1 0
5 Jenny Jones 0 1 1 0
id name_first name_last age x1 x2 x3 x_text y1 y2 y3 z1 z2 z3
1 Tom Smith 62 1 Lorem 1
2 Penny Jones 52 0 1 1 0 1 0
3 Trevor 37 1 0
4 Ursula Smith 44 1 Lorem 1 1 0
5 Jenny Jones 0 1 1 0

Who what where why when

Check invalid responses

  • Wording
  • Response options
  • Context
  • Layout
  • Scanner

Check valid responses

  • Missingness
  • Response options
  • Question topic
  • Question number

Time for bed

Time for bed

  • Big job
  • Get the data dictionary sorted first
    • Write generalisable code
  • Look at the data
    • Write tests