Data manipulation

More dplyr (package in the tidyverse)

Review data transformation with dplyr

  • expects tidy data

    • each variable in its own column

    • each observation in its own row

  • works with pipes |>

    • x |> f becomes f(x, y)
    • |> take something as input for next command
      • “then”
  • functions covered

    • glimpse()

    • count()

    • filter()

    • select()

    • mutate()

Review key functions in dplyr

function() Action
glimpse() get a glimpse of your data
count() count the unique values of one or more variables
filter() picks rows based on their values
mutate() creates new variables (columns)
select() picks variables (columns)
summarize() reduces multiple values down to a single statistic
arrange() changes the order of the rows based on their values
group_by() create subsets of data to apply functions to


  • dplyr new functions we will cover today:

    • summarize()

    • arrange()

    • group_by()

Load the packages

library(skimr) #install first

Basic Usage of summarize

Find the average price of all cars:

cars93 |>
  summarize(avg_price = mean(price))

‘summarize’ to calculate summary statistics



  • sd(): the standard deviation


  • min(): the minimum value (p0)
  • max(): the minimum value (p100)


  • first() value
  • last() value
  • nth() value


  • n()
  • n_distinct()

Your turn

Find the maximum mpg_city for of all cars:

??? |>
  ???(max_mpg_city = ???(???))

Using summarize with group_by

Calculate average price for each type:

cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price))

Your turn: summarize with group_by

Calculate maximum mpg_city for each drive_train:

cars93 |>
  ???(???) |>
  ???(max_mpg_city = ???(???))

Multiple summaries

Calculate the average and maxiumum price for each type

cars93 |>
  group_by(type) |>
  summarize(avg_price = mean(price),
            max_price = max(price))

Your turn: multiple summaries

Calculate the median and minimum weight for each drive_train

??? |>
  ???(???) |>
  ???(med_weight = ???(???),
      min_weight = ???(???))

Summarize: n() and group_by()

Calculate the number of cars from each type

cars93 |>
  group_by(type) |>
  summarize(count = n())

Using count instead of n()

Calculate the number of cars from each type

cars93 |>
  group_by(type) |>

Your turn n() and group_by()

Calculate the number of cars from each weight

??? |>
  ???(???) |>
  ???(count  = ??? )

Sorting by a single column

Arrange cars based on their price:

cars93 |>

Your turn: sorting by a single column

Arrange cars based on their mpg_city:

cars93 |>

Descending sorting

Arrange cars in descending order based on their price:

cars93 |>

Sorting by a multiple columns

Arrange cars by passengers and then by price

  • then select passengers and price

cars93 |>
  arrange(passengers, price)  |> 
  select(passengers, price)


Recap of summarize

  • Powerful tool for data aggregation.
  • Enhances analysis when combined with group_by.
  • Allows for multiple summary operations at once.

Recap of arrange

  • Function for row ordering in data frames.
  • Compatible with single or multiple sorting variables.


  • Four questions on results of code run on datasets in openintro package
    • Questions are numeric and multiple choice (make sure you understand the code covered today)