Designing data infrastructure where people come first

Ben Harrap

2025-09-30

Acknowledgement

Mayi Kuwayu Study

The MK front page
  • Indigenous Data Sovereignty and Governance
  • First wave beginning in 2018
  • Living cohort
  • Over 16,000 responses to date
  • Majority paper-based

Data cleaning as perpetual stew

Source: Wikimedia

A very common method of data cleaning:

  • One (very busy) person
  • A mix of R scripts and spreadsheets
  • Multiple versions of these files
  • Minimal documentation
  • Required author knowledge in order to run from start to finish

Making a new meal

Source: Wikimedia

A rare opportunity to re-do everything — storage, cleaning, documentation

  • Keeping existing processes/elements that were working
  • Ideally very similar output
  • Incorporate long-term requirements
    • Data linkage
    • Improved participant management
    • Data quality reporting
    • Continuously updated datasets

Making a new meal

Source: Wikimedia

The human context:

  • Majority non-R team
    • Main exposure was the cleaning code
    • Too busy* to learn R
  • Existing processes that mostly* work
    • Spreadsheets for certain processes
    • Usable data
  • Only one of me
    • Academia >:(

What was I going to cook?

Source: StockSnap
  • REDCap/APIs instead of spreadsheets
  • renv to manage package versions/dependencies
  • targets to track the whole pipeline
  • tidyverse because… tidyverse
  • testthat and pointblank for testing and validation
  • git for version control
  • Quarto documentation

Testing out the recipe

Celebrity chef Nick Tierney

The next step was to show people the code

  • Expert review
    • Integrity
    • New ideas
  • Novice review
    • Teaching opportunity
    • Readability
  • Team review
    • Bringing people along
    • Blowing minds (mine too)

What did I cook?

Source: StockSnap
  • REDCap/APIs instead of spreadsheets
  • renv to manage package versions/dependencies
  • targets to track the whole pipeline
  • tidyverse because… tidyverse
  • testthat and pointblank for testing and validation
  • git for version control
  • Quarto documentation

Now we’re cooking!

Closing thoughts:

  • install.packages("trust")
  • Code is only reproducible so long as it’s maintainable (sorry, targets)
  • Changing minds takes time and enthusiasm
  • Transparency and openness to changing my own plans

Stuff I thought was really cool

  • Cycle of cleaning -> audit -> cleaning
  • httr2
  • Converging to a pseudo-targets setup
  • The naming convention I proposed
  • Missing data
  • Including scripts in Quarto docs