(in Stata and R)
2025-02-14
In this session we’ll cover:
R is the language that the code is written in.
RStudio is the IDE many people use to write R code in.
At a minimum you need to install R (r-project.org)
IDE options include:
One of the greatest strengths of R is that it is open-source and there are an enormous number of packages available.
A package is a collection of functions usually written around a particular goal or task.
Packages I recommend:
tidyverse
which includes:
dplyr
for data manipulationggplot2
for data visualisationstringr
for working with text datalubridate
for working with datesjanitor
which has many useful cleaning toolshaven
for reading Stata filesFirst you need to install packages:
Then you need to load them into your session using library()
:
In Stata, you load one dataset and all commands are executed in relation to the currently loaded dataset.
In R, the default is for commands to output their result in the console.
This means you need to assign the output of your commands to something if you want to store it.
If we read in a dataset without assigning anything, it just gets displayed:
# A tibble: 74 × 12
make price mpg rep78 headroom trunk weight length turn displacement
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AMC Concord 4099 22 3 2.5 11 2930 186 40 121
2 AMC Pacer 4749 17 3 3 11 3350 173 40 258
3 AMC Spirit 3799 22 NA 3 12 2640 168 35 121
4 Buick Cent… 4816 20 3 4.5 16 3250 196 40 196
5 Buick Elec… 7827 15 4 4 20 4080 222 43 350
6 Buick LeSa… 5788 18 3 4 21 3670 218 43 231
7 Buick Opel 4453 26 NA 3 10 2230 170 34 304
8 Buick Regal 5189 20 3 2 16 3280 200 42 196
9 Buick Rivi… 10372 16 3 3.5 17 3880 207 43 231
10 Buick Skyl… 4082 19 3 3.5 13 3400 200 42 231
# ℹ 64 more rows
# ℹ 2 more variables: gear_ratio <dbl>, foreign <dbl+lbl>
Instead we have to give it a name and assign to the environment:
Being able to have many things assigned in the environment at once opens up many possibilities:
frame
preserve
and restore
In Stata, you write lines of code that get executed one after the other, individually and sequentially.
In R, you can create pipelines using pipes (|>
) to link functions together and avoid having to repeatedly refer to the data you want to work with.
The data you want to work with is contained in a file located somewhere.
To start using the data you need to consider:
.dta
,.xlsx
,.csv
,.txt
,.SAV
)Relative vs. absolute directories
Users/Ben/Documents/GitHub/workshops/2025-02-14-coding-in-stata-and-r/data/auto.dta
./data/auto.dta
Network locations, root location different according to machine type.
Generally speaking, avoid using absolute pathways when possible.
Navigate to other directories from the current working directory using ..
to go ‘up’ a folder:
../2024-08-20-admin-data/img/aihw.PNG
In Stata you would set your working directory using cd
, such as
In R you can use setwd()
to set your working directory
Alternatively, use R projects (.Rproj
)
- 2025-02-14-coding-in-stata-and-r
- _brand.yml
- 2025-02-14-coding-in-stata-and-r.Rproj
- slides.html
- slides.qmd
- data
- img
For a file in the same location as the current working directory:
For a file that is in another folder:
For a file that is online:
For a file in the same location as the current working directory:
For a file that is in another folder:
For a file that is online:
Raw data containing observations of three different types of penguins, made available through https://allisonhorst.github.io/palmerpenguins/
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PAL0910 | 141 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | Adult, 1 Egg Stage | N80A1 | Yes | 2009-11-14 | 40.2 | 17.1 | 193 | 3400 | FEMALE | 9.28810 | -25.54976 | NA |
PAL0910 | 47 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N87A1 | Yes | 2009-11-27 | 50.1 | 17.9 | 190 | 3400 | FEMALE | 9.46819 | -24.45721 | NA |
PAL0809 | 54 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N14A2 | Yes | 2008-11-04 | 50.1 | 15.0 | 225 | 5000 | MALE | 8.50153 | -26.61414 | NA |
PAL0910 | 65 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N99A1 | No | 2009-11-21 | 43.5 | 18.1 | 202 | 3400 | FEMALE | 9.37608 | -24.40753 | Nest never observed with full clutch. |
PAL0910 | 97 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N20A1 | Yes | 2009-11-18 | 49.4 | 15.8 | 216 | 4925 | MALE | 8.03624 | -26.06594 | NA |
PAL0809 | 49 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N12A1 | Yes | 2008-11-02 | 44.9 | 13.3 | 213 | 5100 | FEMALE | 8.45167 | -26.89644 | NA |
PAL0809 | 64 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Biscoe | Adult, 1 Egg Stage | N28A2 | Yes | 2008-11-13 | 41.1 | 18.2 | 192 | 4050 | MALE | 8.62264 | -26.60023 | NA |
PAL0809 | 54 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Biscoe | Adult, 1 Egg Stage | N22A2 | Yes | 2008-11-09 | 42.0 | 19.5 | 200 | 4050 | MALE | 8.48095 | -26.31460 | NA |
PAL0708 | 13 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N37A1 | Yes | 2007-11-29 | 45.5 | 13.7 | 214 | 4650 | FEMALE | 7.77672 | -25.41680 | NA |
PAL0708 | 4 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N62A2 | Yes | 2007-11-26 | 45.4 | 18.7 | 188 | 3525 | FEMALE | 8.64701 | -24.62717 | NA |
Variables might be imported with difficult to work with names, so we want to rename them
# A tibble: 6 × 17
studyName `Sample Number` Species Region Island Stage `Individual ID`
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 PAL0708 1 Adelie Penguin … Anvers Torge… Adul… N1A1
2 PAL0708 2 Adelie Penguin … Anvers Torge… Adul… N1A2
3 PAL0708 3 Adelie Penguin … Anvers Torge… Adul… N2A1
4 PAL0708 4 Adelie Penguin … Anvers Torge… Adul… N2A2
5 PAL0708 5 Adelie Penguin … Anvers Torge… Adul… N3A1
6 PAL0708 6 Adelie Penguin … Anvers Torge… Adul… N3A2
# ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
# `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
# `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
# `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
Stata automatically renames variables that contain spaces and other invalid characters, R does not.
In Stata, the old name comes first, the new name comes second
Like saying “this old name becomes this new name”
The janitor
package contains a handy function called clean_names()
:
[1] "study_name" "sample_number" "species"
[4] "region" "island" "stage"
[7] "individual_id" "clutch_completion" "date_egg"
[10] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm"
[13] "body_mass_g" "sex" "delta_15_n_o_oo"
[16] "delta_13_c_o_oo" "comments"
To keep only the specified variables, we use keep
.
To drop the specified variables, we use drop
.
select()
has a lot of other useful features!
Ordering:
Matching patterns:
Selecting rantes:
Observations (rows) are also dropped using drop
.
Multiple conditions are specified with commas, where all must be true:
You can also make use of %in%
with lists to make cleaner criteria:
Several options in Stata:
gen
for making new variablesegen
for new variables based on a specific functionreplace
for modifying existing variablesThe if
argument is where conditions are typically specified in Stata.
Two common ways of specifying conditions in R:
ifelse()
, in base Rcase_when()
from dplyr
Stata displays missing values in strings as blank ""
, and uses .
in all other cases.
R displays all missing values as NA
, meaning the empty string ""
is not considered missing.
Stata interprets .
as a large positive number, which can cause issues in conditional statements.
R does not do this, missing is a separate category.
Generally speaking, everything is either a number or a string.
Numbers can be bytes, integers, decimals, floats, doubles.
Categorical data can be represented as a string or as a label on a numeric variable.
Logical variables (TRUE or FALSE) do not exist.
R has more data types than Stata, including logical variables
Categorical data can be stored as factors:
In Stata, the by
command is used as a prefix for certain functions
The data must be sorted before the by operation can be used.
Grouping can be done in two locations.
As part of a pipe:
Or as part of a function, if it permits:
There are a few good options for making and formatting tables in R:
gt
gtsummary
kable
(with kableExtra
)tinytable
You are drafting a report or paper using Microsoft Word.
You have:
Oh no! You now have to update everything because of:
Use Quarto to write your report/paper and automatically generate the tables, figures, and in-text reporting!
This has several benefits:
Quarto is open-source scientific and technical publishing system. It lets you:
Text written with Quarto uses markup to apply styling. In this sentence, the word **bold** is bolded and *italic* is italicised.
{fig-align="left" width=20%}
You can do all the normal stuff:
1. Make lists
2. Add citations
3. etc.
Text written with Quarto uses markup to apply styling. In this sentence, the word bold is bolded and italic is italicised.
You can do all the normal stuff:
You can create workflow that suits you. I can imagine people:
I recently had a paper published: https://doi.org/10.1016/j.anzjph.2025.100249
I did everything we’ve just covered when writing up this paper:
My entire thesis is like this actually — the code and files are available: https://github.com/benharrap/thesis
Every Quarto document starts with the parameters:
Then the files get read in
Demographic and other characteristics for the matched cohort of never- and ever-placed children | ||
---|---|---|
Never-placed | Ever-placed | |
N | 7,442 | 3,721 |
IRSAD at birth | ||
1st quintile | 3,559 (47.8%) | 1,781 (47.9%) |
2nd quintile | 1,336 (18.0%) | 675 (18.1%) |
3rd quintile | 1,717 (23.1%) | 836 (22.5%) |
4th quintile | 655 (8.8%) | 334 (9.0%) |
5th quintile | 175 (2.4%) | 95 (2.6%) |
Remoteness area at birth | ||
Major cities | 3,802 (51.1%) | 1,899 (51.0%) |
Inner regional | 368 (4.9%) | 189 (5.1%) |
Outer regional | 1,101 (14.8%) | 552 (14.8%) |
Remote | 1,371 (18.4%) | 671 (18.0%) |
Very remote | 800 (10.7%) | 410 (11.0%) |
All hospital periods | ||
All ages | 43,962 | 29,369 |
Ages 0-4 | 22,048 (50.2%) | 15,586 (53.1%) |
Ages 5-9 | 12,997 (29.6%) | 7,576 (25.8%) |
Ages 10-14 | 8,913 (20.3%) | 6,207 (21.1%) |
All PPH periods | ||
All ages | 7,630 | 6,491 |
Ages 0-4 | 5,366 (70.3%) | 4,850 (74.7%) |
Ages 5-9 | 1,752 (23%) | 1,227 (18.9%) |
Ages 10-14 | 512 (6.7%) | 414 (6.4%) |
While government departments in Australia [@aihw2020b;@dohwa2017;@ahmac2017] and researchers [@li2009;@guthrie2012] report rates of PPHs...
After excluding `r n_missing` who had missing data on the matching variables, a total of `r n_matching` children were included in the matching process. There were `r n_ever` ever-placed children who were matched with `r n_never` never-placed children.
When considering the mechanisms for prevention suggested by Anderson et al. [@anderson2012], most conditions were preventable through early access to primary care.
While government departments in Australia1–3 and researchers4,5 report rates of PPHs…
After excluding 125 who had missing data on the matching variables, a total of 33,403 children were included in the matching process. There were 3,721 ever-placed children who were matched with 7,442 never-placed children.
When considering the mechanisms for prevention suggested by Anderson et al.6, most conditions were preventable through early access to primary care.
Slides available from https://benharrap.github.io/workshops/2025-02-14-coding-in-stata-and-r/slides.html