Exploring categorical data

based on IMS Ch 4: Exploratory data analysis

Outline

define categorical data
discuss the different types of categorical data
create tables (frequency and contingency) of counts of categorical data
why visualize data
visualization of categorical data

What is categorical data?

categorical data is data that is classified into groups or categories

What are the different types of categorical data?

nominal data
ordinal data

Load the libraries

```{r}
library(tidyverse)
library(openintro)
```

Your turn - `loan50`

glimpse the data loan50
which variables are categorical (type: fct, lgl)

Frequency table code

```{r}
loan50 |>
  count(homeownership) 
```

Your turn - frequency table

Using dataset loan50
create a frequency table loan_purpose
what is the most common reason for a loan?
- the number of people that took a loan for this reason is:

Your turn - frequency table

Using dataset loan50
create a frequency table of grade
what is the most common grade for a loan?
- how many people had loans with that grade?

Contingency table

counts for all combinations of two categorical variables

```{r}
table(purpose = loan50$loan_purpose, homeownership = loan50$homeownership)
```

                    homeownership
purpose              rent mortgage own
                        0        0   0
  car                   1        1   0
  credit_card           7        6   0
  debt_consolidation    8       12   3
  home_improvement      0        5   0
  house                 0        1   0
  major_purchase        0        0   0
  medical               0        0   0
  moving                0        0   0
  other                 3        1   0
  renewable_energy      1        0   0
  small_business        1        0   0
  vacation              0        0   0
  wedding               0        0   0

The count of the category with the most observations is:
The value of loan_purpose is:
The value of homeownership is:

Your turn - contingency table

Using dataset loan50
create a contingency table of grade and verified_income
what is the most common combination of grade and verified_income
How many loans have grade “A” and are “Not Verified”?

Using tidyverse - contingency table

```{r}
loan50 |> 
  count(loan_purpose, homeownership) |>
  pivot_wider(names_from = homeownership, values_from = n, values_fill = 0)
```

Why visualize data

spot patterns and trends
communicate data to others
visual representation much more effective than table
identify outliers
explore data

How to visualize categorical data

bar charts

`ggplot`

ggplot() defines plot object
- 1st argument is data
- 2nd argument is mapping how variables in dataset mapped o visual properties (aestheticss) of plot
  - mapping = aes(x = variable x-axis, y = varible on y axis)
add layers with geom_
- bar_chart: geom_bar()

Bar chart 1 categorical variable

```{r}
ggplot(data = loan50, mapping = aes(x = homeownership)) +
  geom_bar() 

# ggplot(loan50, aes(x = homeownership)) +
#  geom_bar() 
```

Your turn - bar chart

create a bar chart of the variable loan_purpose

Bar chart 1 var, y-axis

plot frequency table

```{r}
ggplot(loan50, aes(y = homeownership)) +
  geom_bar() 
```

Your turn - bar chart

create a bar chart of the variable loan_purpose on the y axis

Bar plots with two variables

plot contingency table

Stacked

```{r}
ggplot(loan50, aes(x = homeownership, 
                  fill = loan_status)) +
  geom_bar() 
```

Dodged

```{r}
ggplot(loan50, aes(x = homeownership, 
                   fill = loan_status)) +
  geom_bar(position = "dodge") 
```

Bar plots with two variables

Standardized

```{r}
ggplot(loan50, 
      aes(x = homeownership, 
                    fill = loan_status)) +
  geom_bar(position = "fill") 
```

Your turn

Create a stacked bar chart for the verified_income and grade variables in the loan50 dataset

Your turn

Create a dodged bar chart for the verified_income and grade variables in the loan50 dataset

Your turn

Create a standardized bar chart for the verified_income and grade variables in the loan50 dataset

Mosaic plots

similar to standardized stacked bar chart but still see relative group sizes of the primary variable (homeownership)

```{r}
ggplot(loan50) +
  geom_mosaic(aes(x = product(homeownership), 
                    fill = verified_income)) 
```

Needed for lab

Filter for missing values

```{r}
ex_data <- tribble(
  ~x, ~y,
  1,  "a",
  2,  NA,
  NA, "c",
  4,  "d",
  NA, "e"
)
```

Filtering for rows where x has missing values

```{r}
ex_data %>% 
  filter(is.na(x))

ex_data |>
  filter(x == "NA")  
```

Filtering out rows where x has missing values

```{r}
ex_data %>% 
  filter(!is.na(x))

ex_data %>% 
  filter(x != "NA")
```

Filtering out rows with missing values on multiple columns

```{r}
ex_data %>% 
  filter(!is.na(x) & !is.na(y))
```

Summary

identify factors
count levels of factors
- frequency table - 1 factor
- contingency table - 2 factors
plot factors
- bar chart

Quiz

9 questions using datasets in openintro package
- Questions are numeric and multiple choice (make sure you understand the code covered today)

Exploring categorical data

Outline

What is categorical data?

What are the different types of categorical data?

Load the libraries

Your turn - loan50

Frequency table code

Your turn - frequency table

Your turn - frequency table

Contingency table

Your turn - contingency table

Using tidyverse - contingency table

Why visualize data

How to visualize categorical data

ggplot

Bar chart 1 categorical variable

Your turn - bar chart

Bar chart 1 var, y-axis

Your turn - bar chart

Bar plots with two variables

Bar plots with two variables

Your turn

Your turn

Your turn

Mosaic plots

Needed for lab

Filtering for rows where x has missing values

Filtering out rows where x has missing values

Filtering out rows with missing values on multiple columns

Summary

Quiz

Your turn - `loan50`

`ggplot`