1: Data

Material based on Chapter 1 Introduction to Modern Statistics

Professor Elizabeth Stanny

Learning Objectives

  • Understanding what data looks like
  • Learning to ask good questions about data
  • Exploring relationships between variables
  • Setting up our R environment with tidyverse

Setting Up Our Tools

# Load the tidyverse package
library(tidyverse)
library(scales)

# Set a clean theme for plots
theme_set(theme_minimal())

Tip

The tidyverse is a collection of R packages designed for data science!

Case Study: Can Stents Prevent Strokes?

The Big Question: Do stents reduce the risk of stroke?

The Experiment:

  • 451 at-risk patients
  • Randomly assigned to two groups:
    • Treatment: Stent + medical care
    • Control: Medical care only
# Create the results data
stent_data <- tibble(
  group = c("Control", "Treatment"),
  patients = c(227, 224),
  strokes_365_days = c(28, 45)
)

stent_data
# A tibble: 2 × 3
  group     patients strokes_365_days
  <chr>        <dbl>            <dbl>
1 Control        227               28
2 Treatment      224               45

Calculating Proportions

# Calculate stroke rates for each group
results <- stent_data |>
  mutate(
    stroke_rate = strokes_365_days / patients,
    percentage = round(stroke_rate * 100, 1)
  )

results
# A tibble: 2 × 5
  group     patients strokes_365_days stroke_rate percentage
  <chr>        <dbl>            <dbl>       <dbl>      <dbl>
1 Control        227               28       0.123       12.3
2 Treatment      224               45       0.201       20.1

Surprising finding: The treatment group had MORE strokes! (20% vs 12.3%)

What is Data?

Data are observations collected from a study or experiment.

# Example: Let's create a small dataset
students <- tibble(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  height_cm = c(165, 178, 172, 168),
  siblings = c(0, 2, 1, 3),
  stats_before = c("No", "Yes", "No", "No")
)

students
# A tibble: 4 × 4
  name    height_cm siblings stats_before
  <chr>       <dbl>    <dbl> <chr>       
1 Alice         165        0 No          
2 Bob           178        2 Yes         
3 Charlie       172        1 No          
4 Diana         168        3 No          

Note

Each row is an observation (also called a case)
Each column is a variable

Types of Variables

Load Data

# Load a dataset about loans
loans <- read_csv("https://www.openintro.org/data/csv/loan50.csv")

# Look at the first few values for each variable 
glimpse(loans)
Rows: 50
Columns: 18
$ state                   <chr> "NJ", "CA", "SC", "CA", "OH", "IN", "NY", "MO"…
$ emp_length              <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
$ term                    <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
$ homeownership           <chr> "rent", "rent", "mortgage", "rent", "mortgage"…
$ annual_income           <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ verified_income         <chr> "Not Verified", "Not Verified", "Verified", "N…
$ debt_to_income          <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
$ total_credit_limit      <dbl> 95131, 51929, 301373, 59890, 422619, 349825, 1…
$ total_credit_utilized   <dbl> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
$ num_cc_carrying_balance <dbl> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
$ loan_purpose            <chr> "debt_consolidation", "credit_card", "debt_con…
$ loan_amount             <dbl> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
$ grade                   <chr> "B", "B", "E", "B", "B", "B", "D", "A", "A", "…
$ interest_rate           <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
$ public_record_bankrupt  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ loan_status             <chr> "Current", "Current", "Current", "Current", "C…
$ has_second_income       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ total_income            <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…

Exploring Our Data

# Basic summary statistics
loans |>
  summarize(
    avg_loan = mean(loan_amount),
    avg_interest = mean(interest_rate),
    min_loan = min(loan_amount),
    max_loan = max(loan_amount)
  )
# A tibble: 1 × 4
  avg_loan avg_interest min_loan max_loan
     <dbl>        <dbl>    <dbl>    <dbl>
1    17083         11.6     3000    40000

Tip

The |> symbol is called a “pipe” - it passes data from one step to the next!

Relationships Between Variables

Key Question: Do variables relate to each other?

Association vs. Independence

Associated Variables

  • Show a pattern or relationship
  • Knowing one helps predict the other
  • Can be positive or negative

Independent Variables

  • No clear pattern
  • Knowing one doesn’t help predict the other

Clear association: Lower grades (riskier loans) have higher interest rates!

Explanatory vs. Response Variables

When we think one variable influences another:

  • Explanatory variable (X): The potential cause
  • Response variable (Y): The potential effect
# Example: Does loan term affect interest rate?
loans |>
  group_by(term) |>
  summarize(
    avg_interest = mean(interest_rate),
    n_loans = n()
  )
# A tibble: 2 × 3
   term avg_interest n_loans
  <dbl>        <dbl>   <int>
1    36         10.8      36
2    60         13.6      14

Here, term is explanatory, interest_rate is response

Observational Studies vs. Experiments

Observational Study

  • Observe without interfering
  • Can show associations
  • Cannot prove causation
  • Example: Survey existing loan data

Experiment

  • Actively assign treatments
  • Random assignment is key
  • CAN establish causation
  • Example: The stent study

Remember This!

Association ≠ Causation

Only well-designed experiments can establish causal relationships!

Your Turn: Practice with Data!

# Load a fun dataset about penguins
penguins <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

# Your task: Explore this data!
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Questions

  1. How many penguins are in the dataset?
  2. What variables do we have?
  3. Which are numerical? Which are categorical?

Creating Your First Visualization

Code
# Remove missing values and create a plot
penguins |>
  drop_na(flipper_length_mm, body_mass_g) |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, 
             color = species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Penguin Size by Species",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)",
       color = "Species") 

What do you notice about the relationship?

Summary Statistics by Group

# Calculate summary statistics by species
penguins |>
  drop_na(body_mass_g) |>
  group_by(species) |>
  summarize(
    count = n(),
    avg_mass = mean(body_mass_g),
    sd_mass = sd(body_mass_g),
    min_mass = min(body_mass_g),
    max_mass = max(body_mass_g)
  ) |>
  arrange(desc(avg_mass))
# A tibble: 3 × 6
  species   count avg_mass sd_mass min_mass max_mass
  <chr>     <int>    <dbl>   <dbl>    <dbl>    <dbl>
1 Gentoo      123    5076.    504.     3950     6300
2 Chinstrap    68    3733.    384.     2700     4800
3 Adelie      151    3701.    459.     2850     4775

Note

group_by() lets us calculate statistics for each group separately!

Key Concepts to Remember

  1. Data Structure: Rows are observations, columns are variables
  2. Variable Types: Numerical (continuous/discrete) vs. Categorical (nominal/ordinal)
  3. Relationships: Variables can be associated or independent
  4. Causation: Only experiments with random assignment can prove causation
  5. R/Tidyverse: Our toolkit for exploring and visualizing data

Practice Problems

Try these with the penguins data:

  1. Create a plot showing bill length vs. bill depth
  2. Calculate the average flipper length for each species
  3. Make a boxplot of body mass by island
  4. Find which species has the most variation in bill length
# Starter code for problem 1
penguins |>
  ggplot(aes(x = ___, y = ___)) +
  geom_point()

Solutions to Practice

# Problem 1: Bill length vs depth
penguins |>
  drop_na(bill_length_mm, bill_depth_mm) |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, 
             color = species)) +
  geom_point(alpha = 0.6) +
  labs(title = "Bill Dimensions by Species")
# Problem 2: Average flipper length
penguins |>
  drop_na(flipper_length_mm) |>
  group_by(species) |>
  summarize(avg_flipper = mean(flipper_length_mm))
# A tibble: 3 × 2
  species   avg_flipper
  <chr>           <dbl>
1 Adelie           190.
2 Chinstrap        196.
3 Gentoo           217.

Resources

Remember

Learning to code is like learning a new language - it takes practice! Be patient with yourself and don’t hesitate to ask for help.