Data manipulation

More dplyr (package in the tidyverse)

Review data transformation with dplyr

  • expects tidy data

    • each variable in its own column

    • each observation in its own row

  • works with pipes |>

    • x |> f becomes f(x, y)
    • |> take something as input for next command
      • “then”
  • functions covered last time

Data transformation with dplyr

  • first argument is a data frame (aka tibble)

  • Subsequent arguments describe what to do with the data frame

    • result is a new data frame

Review key functions in dplyr

function() Action
glimpse() get a glimpse of your data
count() count the unique values of one or more variables
filter() picks rows based on their values
mutate() creates new variables (columns)
select() picks variables (columns)
summarize() reduces multiple values down to a single statistic
arrange() changes the order of the rows based on their values
group_by() create subsets of data to apply functions to

Your turn review: library()

Replace ??? in code chunk below to access:

  • functions in tidyverse (includes dplyr)

  • data: cars93

```{r}
???(???)
???(???)
```

Your turn review: glimpse()

Replace ??? in code chunk below for an:

  • overview of dataset cars93
```{r}
???(???)
```

Your turn review: count()

Replace ??? in code chunk below:

  • to find out what mpg_city do the most cars get?
```{r}
cars93  |> 
  ???(???, sort=TRUE)
```

Data is tidy

  • Each row is a car (observation)
  • Variables (columns) contain information on a car

Filter: comparison operators

  • == equality

  • > greater than

  • < less than

  • >= greater than or equal to

  • <= less than or equal to

  • != not equal to

  • between numeric variable in a specified range

  • near compare 2 numeric vectors. Set tolerance

Filter: combine criteria

Combine criteria using operators that make comparisons:

  • | or

  • & and ,

Filter: between

```{r}
#| eval: true

cars93  |> 
  filter(between(price, 15, 30))
```
# A tibble: 23 × 6
   type    price mpg_city drive_train passengers weight
   <fct>   <dbl>    <int> <fct>            <int>  <int>
 1 small    15.9       25 front                5   2705
 2 midsize  30         22 rear                 4   3640
 3 midsize  15.7       22 front                6   2880
 4 large    20.8       19 front                6   3470
 5 large    23.7       16 rear                 6   4105
 6 midsize  26.3       19 front                5   3495
 7 midsize  15.9       21 front                6   3195
 8 large    18.8       17 rear                 6   3910
 9 large    18.4       20 front                6   3515
10 large    29.5       20 front                6   3570
# ℹ 13 more rows

Filter: near

```{r}
#| eval: true

cars93  |> 
  filter(near(price, 15, tol = 3))
```
# A tibble: 9 × 6
  type    price mpg_city drive_train passengers weight
  <fct>   <dbl>    <int> <fct>            <int>  <int>
1 small    15.9       25 front                5   2705
2 midsize  15.7       22 front                6   2880
3 midsize  15.9       21 front                6   3195
4 midsize  15.6       21 front                6   3080
5 small    12.2       29 front                5   2295
6 small    12.1       42 front                4   2350
7 midsize  13.9       20 front                5   2885
8 midsize  14.9       19 rear                 5   3610
9 midsize  16.3       23 front                5   2890

Your turn review: filter()

  • Extract the cars with the most common mpg_city

  • How many observations will you have?

```{r}
cars93  |>
  ???(??? == ???)
```

Your turn review: filter()

  • Extract the cars with the most common mpg_city that are large (type)
```{r}
cars93  |>
  ???(??? == ??? & ??? == ???)
```

Your turn review: filter()

  • Extract the cars that have the most common mpg_city OR are large (type)
```{r}
cars93  |>
  ???(??? == 19 ??? type == ???)
```

Outline

  • dplyr new functions we will cover today:

    • select()

    • mutate()

  • skimr

    • skim() summary statistics

Help and examples for a function

  • in console:

    • ?function
      • if package loaded (i.e., library(package))
      • ?mutate
    • ?package::function
      • if package not loaded
      • ?dplyr::filter()

Load skimr package

```{r}
library(skimr) 
```

skim for overview of data

```{r}
#| eval: true
cars93  |> 
  skim()
```
Data summary
Name cars93
Number of rows 54
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
type 0 1 FALSE 3 mid: 22, sma: 21, lar: 11
drive_train 0 1 FALSE 3 fro: 43, rea: 9, 4WD: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price 0 1 19.99 11.51 7.4 10.95 17.25 26.25 61.9 ▇▅▂▁▁
mpg_city 0 1 23.31 6.62 16.0 19.00 21.00 28.00 46.0 ▇▂▂▁▁
passengers 0 1 5.11 0.69 4.0 5.00 5.00 6.00 6.0 ▃▁▇▁▅
weight 0 1 3037.41 657.66 1695.0 2452.50 3197.50 3522.50 4105.0 ▂▆▃▇▃

Your turn

  • How many observations?

  • How many variables are categorical (fct)?

  • How many variables are numerical?

  • Are there any missing values for any of the variables?

  • What is the mean (average) price ?

  • What is the maximum mpg_city?

Using select with cars93

  • select is a function from the dplyr package (part of the tidyverse)

  • Used to select or rename columns

select() to select columns

Select the variables (columns) that are factors

```{r}
#| eval: true
cars93  |> 
  select(type, drive_train)
```
# A tibble: 54 × 2
   type    drive_train
   <fct>   <fct>      
 1 small   front      
 2 midsize front      
 3 midsize front      
 4 midsize rear       
 5 midsize front      
 6 large   front      
 7 large   rear       
 8 midsize front      
 9 large   front      
10 midsize front      
# ℹ 44 more rows

Your turn: select()

Select the variables that are numeric

```{r}
???  |> 
 ???(??? , ???, ???, ???  )
```

mutate with cars93

  • mutate is a function from the dplyr package (part of the tidyverse)

  • Used to create or modify columns

mutate() to create a new column

Create a new column price_d

```{r}
cars93  |> 
  mutate(price_d = price * 1000 )
```

Your turn: mutate()

Create a new column weight_kg that is weight in kiligrams

To convert pounds (lbs) to kilograms (kgs), use the following formula:

\(kgs = lbs \times 0.45\)

```{r}
???  |> 
  ???(weight_kg = ??? * 0.45 )
```

Summary

  • glimpse()

  • count()

  • filter()

  • select()

  • mutate()

Quiz

  • Two question on a result of code run on dataset from an OpenIntro Data Set

  • Seven questions on dplyr functions