Data manipulation

More dplyr (package in the tidyverse)

Review data transformation with dplyr

  • expects tidy data

    • each variable in its own column

    • each observation in its own row

  • works with pipes |>

    • x |> f becomes f(x, y)
    • |> take something as input for next command
      • “then”
  • functions covered

    • glimpse()

    • count()

    • filter()

    • select()

    • mutate()

Review key functions in dplyr

function() Action
glimpse() get a glimpse of your data
count() count the unique values of one or more variables
filter() picks rows based on their values
mutate() creates new variables (columns)
select() picks variables (columns)
summarize() reduces multiple values down to a single statistic
arrange() changes the order of the rows based on their values
group_by() create subsets of data to apply functions to

Outline

  • dplyr new functions we will cover today:

    • summarize()

    • arrange()

    • group_by()

Load the packages

```{r}
library(tidyverse)
library(openintro)
library(skimr) #install first
```

Basic Usage of summarize

Find the average price of all cars:

```{r}
#| eval: true
cars93 |>
  summarize(avg_price = mean(price))
```

‘summarize’ to calculate summary statistics

Center

Spread

  • sd(): the standard deviation

Range

  • min(): the minimum value (p0)
  • max(): the minimum value (p100)

Position

  • first() value
  • last() value
  • nth() value

Count

  • n()
  • n_distinct()

Your turn


Find the maximum mpg_city for of all cars:

```{r}
??? |>
  ???(max_mpg_city = ???(???))
```

Using summarize with group_by


Calculate average price for each type:

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price))
```

Your turn: summarize with group_by


Calculate maximum mpg_city for each drive_train:

```{r}
cars93 |>
  ???(???) |>
  ???(max_mpg_city = ???(???))
```

Multiple summaries


Calculate the average and maxiumum price for each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price),
            max_price = max(price))
```

Your turn: multiple summaries


Calculate the median and minimum weight for each drive_train

```{r}
??? |>
  ???(???) |>
  ???(med_weight = ???(???),
      min_weight = ???(???))
```

Summarize: n() and group_by()


Calculate the number of cars from each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  summarize(count = n())
```

Using count instead of n()


Calculate the number of cars from each type

```{r}
#| eval: true
cars93 |>
  group_by(type) |>
  count(type) 
```

Your turn n() and group_by()


Calculate the number of cars from each weight

```{r}
??? |>
  ???(???) |>
  ???(count  = ??? )
```

Sorting by a single column


Arrange cars based on their price:

```{r}
#| eval: true
cars93 |>
  arrange(price)
```

Your turn: sorting by a single column


Arrange cars based on their mpg_city:

```{r}
cars93 |>
  ???(???)
```

Descending sorting


Arrange cars in descending order based on their price:

```{r}
#| eval: true
cars93 |>
  arrange(desc(price))
```

Sorting by a multiple columns


Arrange cars by passengers and then by price

  • then select passengers and price


```{r}
#| eval: true
cars93 |>
  arrange(passengers, price)  |> 
  select(passengers, price)
```

Summary

Recap of summarize

  • Powerful tool for data aggregation.
  • Enhances analysis when combined with group_by.
  • Allows for multiple summary operations at once.

Recap of arrange

  • Function for row ordering in data frames.
  • Compatible with single or multiple sorting variables.

Quiz

  • Four questions on results of code run on datasets in openintro package
    • Questions are numeric and multiple choice (make sure you understand the code covered today)