Effective communication of exploratory results

BUS 320

Outline

Steps in data analysis

Load the packages

library(tidyverse)
library(skimr)
library(scales) 
library(plotly)
theme_set(theme_minimal()) #set theme for all plots 

Read the dataset

life_expectancy  <- read_csv("https://bus320-quarto.netlify.app/data/life-expectancy.csv")

Glimpse the data

glimpse(life_expectancy)
Rows: 20,449
Columns: 4
$ Entity                                  <chr> "Afghanistan", "Afghanistan", …
$ Code                                    <chr> "AFG", "AFG", "AFG", "AFG", "A…
$ Year                                    <dbl> 1950, 1951, 1952, 1953, 1954, …
$ `Life expectancy at birth (historical)` <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, …
  • How many rows does life_expectancy have?

Check for missing values

  • Which variable has missing values? How many?
skim(life_expectancy)
Data summary
Name life_expectancy
Number of rows 20449
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Entity 0 1.00 4 59 0 256 0
Code 1390 0.93 3 8 0 237 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1976.53 37.74 1543 1962.0 1982.0 2002 2021.0 ▁▁▁▁▇
Life expectancy at birth (historical) 0 1 61.78 12.94 12 52.5 64.3 72 86.5 ▁▂▅▇▅

Your turn

  • eliminate the rows that have a missing code

  • save the output to life_no_missing

  • glimpse life_no_missing

  • how many rows are left?

Count obs for each Entity

  • Do Entity’s have different numbers of observations?
life_no_missing  |> 
  count(Entity)

Randomly select 6 entities to analyze

  • use set.seed() so your results will be replicable

  • save to entities

set.seed(3206) 

entities  <- life_no_missing  |> 
  count(Entity) |> 
  slice_sample(n=6)  |>
  pull(Entity)

Display entities

entities
[1] "Vatican"     "Isle of Man" "Togo"        "Serbia"      "Samoa"      
[6] "Norway"     

Prepare data for plotting

  • extract only the rows for the 6 entities

  • rename the last column to expectancy

  • drop the variable Code

  • save to object life_df

life_df  <- life_no_missing  |> 
  filter(Entity %in% entities)   |> 
  rename(expectancy = 4)  |> 
  select(-Code)  

glimpse life_df

glimpse(life_df)
Rows: 536
Columns: 3
$ Entity     <chr> "Isle of Man", "Isle of Man", "Isle of Man", "Isle of Man",…
$ Year       <dbl> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,…
$ expectancy <dbl> 61.4, 62.3, 64.0, 64.9, 64.6, 65.3, 64.9, 65.0, 65.0, 65.5,…

Find the min and max year

  • for your entities find the min and max year
life_df  |> summarize(min(Year), max(Year))

Plot Year vs expectancy for each entity

  • use ggplot2 to create a line plot for each entity with Year on the x-axis and expectancy on the y-axis

  • add points to the plot

  • format the y-axis to display the values with the suffix “years”

  • add a title, “Life Expectancy, 1876 to 2021”, min year to max year

  • change the color using scale_color_

    • look at the help for ?scale_color_viridis_d` to see how to change the color palette

life_df |>
  ggplot(aes(x = Year, y = expectancy, color = Entity)) +
  geom_line() +
  geom_point()+
  scale_y_continuous(labels = number_format(suffix = " years")) +
  labs(x = NULL, y = NULL, color = NULL, title = "Life Expectancy, 1876 to 2021", caption = "Source Our World in Data (https://ourworldindata.org/life-expectancy)") +
  scale_color_viridis_d(option = "plasma", begin = 0, end = 0.8)

Assign the plot to an object p_life_df

p_life_df  <- life_df |>
  ggplot(aes(x = Year, y = expectancy, color = Entity)) +
  geom_line() +
  geom_point()+
  scale_y_continuous(labels = number_format(suffix = " years")) +
  labs(x = NULL, y = NULL, color = NULL, title = "Life Expectancy, 1876 to 2021", caption = "Source Our World in Data (https://ourworldindata.org/life-expectancy)") +
  scale_color_viridis_d(option = "plasma", begin = 0, end = 0.8)

Create a plotly object

ggplotly(p_life_df)

Add text the plot

  • add a text annotation to the plot
p_life_df +
  annotate("text", x = 1900, y = 75, label = "For these countries life expectancy\n at birth before \n 1950 was less than 60 years ", color = "black", size = 3) 

Quiz

  • 4 Multiple choice questions
  • 3 Numeric questions
  • Quiz instructions will have a dataset that you will read in similar to life_expectancy