Data manipulation

More dplyr (package in the tidyverse)

Review data transformation with `dplyr`

expects tidy data
- each variable in its own column
- each observation in its own row
works with pipes |>
- x |> f becomes f(x, y)
- |> take something as input for next command
  - “then”
functions covered
- glimpse()
- count()
- filter()
- select()
- mutate()

Review key functions in `dplyr`

`function()`	Action
`glimpse()`	get a glimpse of your data
`count()`	count the unique values of one or more variables
`filter()`	picks rows based on their values
`mutate()`	creates new variables (columns)
`select()`	picks variables (columns)
`summarize()`	reduces multiple values down to a single statistic
`arrange()`	changes the order of the rows based on their values
`group_by()`	create subsets of data to apply functions to

Outline

dplyr new functions we will cover today:
- summarize()
- arrange()
- group_by()

Load the packages

```{r}
library(tidyverse)
library(openintro)
library(skimr) #install first
```

Basic Usage of `summarize`

Find the average price of all cars:

```{r}
#| eval: true
cars93 |>
  summarize(avg_price = mean(price))
```

‘summarize’ to calculate summary statistics

Center

mean(): the average
median(): the middle value

Spread

sd(): the standard deviation

Range

min(): the minimum value (p0)
max(): the minimum value (p100)

Position

first() value
last() value
nth() value

Count

n()
n_distinct()

Your turn

Find the maximum mpg_city for of all cars:

```{r}
??? |>
  ???(max_mpg_city = ???(???))
```

Using `summarize` with `group_by`

Calculate average price for each type:

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price))
```

Your turn: `summarize` with `group_by`

Calculate maximum mpg_city for each drive_train:

```{r}
cars93 |>
  ???(???) |>
  ???(max_mpg_city = ???(???))
```

Multiple summaries

Calculate the average and maxiumum price for each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price),
            max_price = max(price))
```

Your turn: multiple summaries

Calculate the median and minimum weight for each drive_train

```{r}
??? |>
  ???(???) |>
  ???(med_weight = ???(???),
      min_weight = ???(???))
```

Summarize: `n()` and `group_by()`

Calculate the number of cars from each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(count = n())
```

Using count instead of `n()`

Calculate the number of cars from each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  count(type) 
```

Your turn `n()` and `group_by()`

Calculate the number of cars from each weight

```{r}
??? |>
  ???(???) |>
  ???(count  = ??? )
```

Sorting by a single column

Arrange cars based on their price:

```{r}
#| eval: true
cars93 |>
  arrange(price)
```

Your turn: sorting by a single column

Arrange cars based on their mpg_city:

```{r}
cars93 |>
  ???(???)
```

Descending sorting

Arrange cars in descending order based on their price:

```{r}
#| eval: true
cars93 |>
  arrange(desc(price))
```

Sorting by a multiple columns

Arrange cars by passengers and then by price

then select passengers and price

```{r}
#| eval: true
cars93 |>
  arrange(passengers, price)  |> 
  select(passengers, price)
```

Summary

Recap of summarize

Powerful tool for data aggregation.
Enhances analysis when combined with group_by.
Allows for multiple summary operations at once.

Recap of arrange

Function for row ordering in data frames.
Compatible with single or multiple sorting variables.

Quiz

Four questions on results of code run on datasets in openintro package
- Questions are numeric and multiple choice (make sure you understand the code covered today)