Exploring categorical data

based on IMS Ch 4: Exploratory data analysis

Outline

  • define categorical data
  • discuss the different types of categorical data
  • create tables (frequency and contingency) of counts of categorical data
  • why visualize data
  • visualization of categorical data

What is categorical data?

  • categorical data is data that is classified into groups or categories

What are the different types of categorical data?

  • nominal data

  • ordinal data

Load the libraries

```{r}
library(tidyverse)
library(openintro)
```

Your turn - loan50

  • glimpse the data loan50

  • which variables are categorical (type: fct, lgl)

Frequency table code

```{r}
loan50 |>
  count(homeownership) 
```

Your turn - frequency table

  • Using dataset loan50

  • create a frequency table loan_purpose

  • what is the most common reason for a loan?

    • the number of people that took a loan for this reason is:

Your turn - frequency table

  • Using dataset loan50

  • create a frequency table of grade

  • what is the most common grade for a loan?

    • how many people had loans with that grade?

Contingency table

  • counts for all combinations of two categorical variables
```{r}
table(purpose = loan50$loan_purpose, homeownership = loan50$homeownership)
```
                    homeownership
purpose              rent mortgage own
                        0        0   0
  car                   1        1   0
  credit_card           7        6   0
  debt_consolidation    8       12   3
  home_improvement      0        5   0
  house                 0        1   0
  major_purchase        0        0   0
  medical               0        0   0
  moving                0        0   0
  other                 3        1   0
  renewable_energy      1        0   0
  small_business        1        0   0
  vacation              0        0   0
  wedding               0        0   0
  • The count of the category with the most observations is:

  • The value of loan_purpose is:

  • The value of homeownership is:

Your turn - contingency table

  • Using dataset loan50

  • create a contingency table of grade and verified_income

  • what is the most common combination of grade and verified_income

  • How many loans have grade “A” and are “Not Verified”?

Using tidyverse - contingency table

```{r}
loan50 |> 
  count(loan_purpose, homeownership) |>
  pivot_wider(names_from = homeownership, values_from = n, values_fill = 0)
```

Why visualize data

  • spot patterns and trends
  • communicate data to others
  • visual representation much more effective than table
  • identify outliers
  • explore data

How to visualize categorical data

  • bar charts

ggplot

  • ggplot() defines plot object

    • 1st argument is data
    • 2nd argument is mapping how variables in dataset mapped o visual properties (aestheticss) of plot
      • mapping = aes(x = variable x-axis, y = varible on y axis)
  • add layers with geom_

    • bar_chart: geom_bar()

Bar chart 1 categorical variable

```{r}
ggplot(data = loan50, mapping = aes(x = homeownership)) +
  geom_bar() 

# ggplot(loan50, aes(x = homeownership)) +
#  geom_bar() 
```

Your turn - bar chart

  • create a bar chart of the variable loan_purpose

Bar chart 1 var, y-axis

  • plot frequency table
```{r}
ggplot(loan50, aes(y = homeownership)) +
  geom_bar() 
```

Your turn - bar chart

  • create a bar chart of the variable loan_purpose on the y axis

Bar plots with two variables

  • plot contingency table

Stacked

```{r}
ggplot(loan50, aes(x = homeownership, 
                  fill = loan_status)) +
  geom_bar() 
```

Dodged

```{r}
ggplot(loan50, aes(x = homeownership, 
                   fill = loan_status)) +
  geom_bar(position = "dodge") 
```

Bar plots with two variables

Standardized

```{r}
ggplot(loan50, 
      aes(x = homeownership, 
                    fill = loan_status)) +
  geom_bar(position = "fill") 
```

Your turn

  • Create a stacked bar chart for the verified_income and grade variables in the loan50 dataset

Your turn

  • Create a dodged bar chart for the verified_income and grade variables in the loan50 dataset

Your turn

  • Create a standardized bar chart for the verified_income and grade variables in the loan50 dataset

Mosaic plots

  • similar to standardized stacked bar chart but still see relative group sizes of the primary variable (homeownership)
```{r}
ggplot(loan50) +
  geom_mosaic(aes(x = product(homeownership), 
                    fill = verified_income)) 
```

Needed for lab

Filter for missing values

```{r}
ex_data <- tribble(
  ~x, ~y,
  1,  "a",
  2,  NA,
  NA, "c",
  4,  "d",
  NA, "e"
)
```

Filtering for rows where x has missing values

```{r}
ex_data %>% 
  filter(is.na(x))

ex_data |>
  filter(x == "NA")  
```

Filtering out rows where x has missing values

```{r}
ex_data %>% 
  filter(!is.na(x))

ex_data %>% 
  filter(x != "NA")
```

Filtering out rows with missing values on multiple columns

```{r}
ex_data %>% 
  filter(!is.na(x) & !is.na(y))
```

Summary

  • identify factors
  • count levels of factors
    • frequency table - 1 factor
    • contingency table - 2 factors
  • plot factors
    • bar chart

Quiz

  • 9 questions using datasets in openintro package
    • Questions are numeric and multiple choice (make sure you understand the code covered today)