Material based on Chapter 1 Introduction to Modern Statistics
Tip
The tidyverse is a collection of R packages designed for data science!
The Big Question: Do stents reduce the risk of stroke?
The Experiment:
# Calculate stroke rates for each group
results <- stent_data |>
mutate(
stroke_rate = strokes_365_days / patients,
percentage = round(stroke_rate * 100, 1)
)
results
# A tibble: 2 × 5
group patients strokes_365_days stroke_rate percentage
<chr> <dbl> <dbl> <dbl> <dbl>
1 Control 227 28 0.123 12.3
2 Treatment 224 45 0.201 20.1
Surprising finding: The treatment group had MORE strokes! (20% vs 12.3%)
Data are observations collected from a study or experiment.
# Example: Let's create a small dataset
students <- tibble(
name = c("Alice", "Bob", "Charlie", "Diana"),
height_cm = c(165, 178, 172, 168),
siblings = c(0, 2, 1, 3),
stats_before = c("No", "Yes", "No", "No")
)
students
# A tibble: 4 × 4
name height_cm siblings stats_before
<chr> <dbl> <dbl> <chr>
1 Alice 165 0 No
2 Bob 178 2 Yes
3 Charlie 172 1 No
4 Diana 168 3 No
Note
Each row is an observation (also called a case)
Each column is a variable
# Load a dataset about loans
loans <- read_csv("https://www.openintro.org/data/csv/loan50.csv")
# Look at the first few values for each variable
glimpse(loans)
Rows: 50
Columns: 18
$ state <chr> "NJ", "CA", "SC", "CA", "OH", "IN", "NY", "MO"…
$ emp_length <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
$ term <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
$ homeownership <chr> "rent", "rent", "mortgage", "rent", "mortgage"…
$ annual_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ verified_income <chr> "Not Verified", "Not Verified", "Verified", "N…
$ debt_to_income <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
$ total_credit_limit <dbl> 95131, 51929, 301373, 59890, 422619, 349825, 1…
$ total_credit_utilized <dbl> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
$ num_cc_carrying_balance <dbl> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
$ loan_purpose <chr> "debt_consolidation", "credit_card", "debt_con…
$ loan_amount <dbl> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
$ grade <chr> "B", "B", "E", "B", "B", "B", "D", "A", "A", "…
$ interest_rate <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
$ public_record_bankrupt <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ loan_status <chr> "Current", "Current", "Current", "Current", "C…
$ has_second_income <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ total_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
# Basic summary statistics
loans |>
summarize(
avg_loan = mean(loan_amount),
avg_interest = mean(interest_rate),
min_loan = min(loan_amount),
max_loan = max(loan_amount)
)
# A tibble: 1 × 4
avg_loan avg_interest min_loan max_loan
<dbl> <dbl> <dbl> <dbl>
1 17083 11.6 3000 40000
Tip
The |>
symbol is called a “pipe” - it passes data from one step to the next!
Key Question: Do variables relate to each other?
Associated Variables
Independent Variables
Clear association: Lower grades (riskier loans) have higher interest rates!
When we think one variable influences another:
# Example: Does loan term affect interest rate?
loans |>
group_by(term) |>
summarize(
avg_interest = mean(interest_rate),
n_loans = n()
)
# A tibble: 2 × 3
term avg_interest n_loans
<dbl> <dbl> <int>
1 36 10.8 36
2 60 13.6 14
Here, term
is explanatory, interest_rate
is response
Observational Study
Experiment
Remember This!
Association ≠ Causation
Only well-designed experiments can establish causal relationships!
# Load a fun dataset about penguins
penguins <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
# Your task: Explore this data!
glimpse(penguins)
Rows: 344
Columns: 8
$ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <chr> "male", "female", "female", NA, "female", "male", "f…
$ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Questions
# Remove missing values and create a plot
penguins |>
drop_na(flipper_length_mm, body_mass_g) |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
color = species)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "Penguin Size by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species")
What do you notice about the relationship?
# Calculate summary statistics by species
penguins |>
drop_na(body_mass_g) |>
group_by(species) |>
summarize(
count = n(),
avg_mass = mean(body_mass_g),
sd_mass = sd(body_mass_g),
min_mass = min(body_mass_g),
max_mass = max(body_mass_g)
) |>
arrange(desc(avg_mass))
# A tibble: 3 × 6
species count avg_mass sd_mass min_mass max_mass
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Gentoo 123 5076. 504. 3950 6300
2 Chinstrap 68 3733. 384. 2700 4800
3 Adelie 151 3701. 459. 2850 4775
Note
group_by()
lets us calculate statistics for each group separately!
Try these with the penguins data:
Remember
Learning to code is like learning a new language - it takes practice! Be patient with yourself and don’t hesitate to ask for help.