Data manipulation

More dplyr (package in the tidyverse)

Review data transformation with `dplyr`

expects tidy data
- each variable in its own column
- each observation in its own row
works with pipes |>
- x |> f becomes f(x, y)
- |> take something as input for next command
  - “then”
functions covered last time
- glimpse()
- count()
- filter()

Data transformation with `dplyr`

first argument is a data frame (aka tibble)
Subsequent arguments describe what to do with the data frame
- result is a new data frame

Review key functions in `dplyr`

`function()`	Action
`glimpse()`	get a glimpse of your data
`count()`	count the unique values of one or more variables
`filter()`	picks rows based on their values
`mutate()`	creates new variables (columns)
`select()`	picks variables (columns)
`summarize()`	reduces multiple values down to a single statistic
`arrange()`	changes the order of the rows based on their values
`group_by()`	create subsets of data to apply functions to

Your turn review: `library()`

Replace ??? in code chunk below to access:

functions in tidyverse (includes dplyr)
data: cars93

```{r}
???(???)
???(???)
```

Your turn review: `glimpse()`

Replace ??? in code chunk below for an:

overview of dataset cars93

```{r}
???(???)
```

Your turn review: `count()`

Replace ??? in code chunk below:

to find out what mpg_city do the most cars get?

```{r}
cars93  |> 
  ???(???, sort=TRUE)
```

Data is tidy

Each row is a car (observation)

Variables (columns) contain information on a car

Filter: comparison operators

== equality
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
between numeric variable in a specified range
near compare 2 numeric vectors. Set tolerance

Filter: combine criteria

Combine criteria using operators that make comparisons:

| or
& and ,

Filter: `between`

```{r}
#| eval: true

cars93  |> 
  filter(between(price, 15, 30))
```

# A tibble: 23 × 6
   type    price mpg_city drive_train passengers weight
   <fct>   <dbl>    <int> <fct>            <int>  <int>
 1 small    15.9       25 front                5   2705
 2 midsize  30         22 rear                 4   3640
 3 midsize  15.7       22 front                6   2880
 4 large    20.8       19 front                6   3470
 5 large    23.7       16 rear                 6   4105
 6 midsize  26.3       19 front                5   3495
 7 midsize  15.9       21 front                6   3195
 8 large    18.8       17 rear                 6   3910
 9 large    18.4       20 front                6   3515
10 large    29.5       20 front                6   3570
# ℹ 13 more rows

Filter: `near`

```{r}
#| eval: true

cars93  |> 
  filter(near(price, 15, tol = 3))
```

# A tibble: 9 × 6
  type    price mpg_city drive_train passengers weight
  <fct>   <dbl>    <int> <fct>            <int>  <int>
1 small    15.9       25 front                5   2705
2 midsize  15.7       22 front                6   2880
3 midsize  15.9       21 front                6   3195
4 midsize  15.6       21 front                6   3080
5 small    12.2       29 front                5   2295
6 small    12.1       42 front                4   2350
7 midsize  13.9       20 front                5   2885
8 midsize  14.9       19 rear                 5   3610
9 midsize  16.3       23 front                5   2890

Your turn review: `filter()`

Extract the cars with the most common mpg_city
How many observations will you have?

```{r}
cars93  |>
  ???(??? == ???)
```

Your turn review: `filter()`

Extract the cars with the most common mpg_city that are large (type)

```{r}
cars93  |>
  ???(??? == ??? & ??? == ???)
```

Your turn review: `filter()`

Extract the cars that have the most common mpg_city OR are large (type)

```{r}
cars93  |>
  ???(??? == 19 ??? type == ???)
```

Outline

dplyr new functions we will cover today:
- select()
- mutate()
skimr
- skim() summary statistics

Help and examples for a function

in console:
- ?function
  - if package loaded (i.e., library(package))
  - ?mutate
- ?package::function
  - if package not loaded
  - ?dplyr::filter()

Load `skimr` package

```{r}
library(skimr) 
```

skim for overview of data

```{r}
#| eval: true
cars93  |> 
  skim()
```

Data summary
Name	cars93
Number of rows	54
Number of columns	6
_______________________
Column type frequency:
factor	2
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
type	0	1	FALSE	3	mid: 22, sma: 21, lar: 11
drive_train	0	1	FALSE	3	fro: 43, rea: 9, 4WD: 2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
price	1	19.99	11.51	7.4	10.95	17.25	26.25	61.9	▇▅▂▁▁
mpg_city	1	23.31	6.62	16.0	19.00	21.00	28.00	46.0	▇▂▂▁▁
passengers	1	5.11	0.69	4.0	5.00	5.00	6.00	6.0	▃▁▇▁▅
weight	1	3037.41	657.66	1695.0	2452.50	3197.50	3522.50	4105.0	▂▆▃▇▃

Your turn

How many observations?
How many variables are categorical (fct)?
How many variables are numerical?
Are there any missing values for any of the variables?
What is the mean (average) price ?
What is the maximum mpg_city?

Using `select` with `cars93`

select is a function from the dplyr package (part of the tidyverse)
Used to select or rename columns

`select()` to select columns

Select the variables (columns) that are factors

```{r}
#| eval: true
cars93  |> 
  select(type, drive_train)
```

# A tibble: 54 × 2
   type    drive_train
   <fct>   <fct>      
 1 small   front      
 2 midsize front      
 3 midsize front      
 4 midsize rear       
 5 midsize front      
 6 large   front      
 7 large   rear       
 8 midsize front      
 9 large   front      
10 midsize front      
# ℹ 44 more rows

Your turn: `select()`

Select the variables that are numeric

```{r}
???  |> 
 ???(??? , ???, ???, ???  )
```

`mutate` with `cars93`

mutate is a function from the dplyr package (part of the tidyverse)
Used to create or modify columns

`mutate()` to create a new column

Create a new column price_d

```{r}
cars93  |> 
  mutate(price_d = price * 1000 )
```

Your turn: `mutate()`

Create a new column weight_kg that is weight in kiligrams

To convert pounds (lbs) to kilograms (kgs), use the following formula:

\(kgs = lbs \times 0.45\)

```{r}
???  |> 
  ???(weight_kg = ??? * 0.45 )
```

Summary

glimpse()
count()
filter()
select()
mutate()

Quiz

Two question on a result of code run on dataset from an OpenIntro Data Set
Seven questions on dplyr functions

Data manipulation

Review data transformation with dplyr

Data transformation with dplyr

Review key functions in dplyr

Your turn review: library()

Your turn review: glimpse()

Your turn review: count()

Data is tidy

Filter: comparison operators

Filter: combine criteria

Combine criteria using operators that make comparisons:

Filter: between

Filter: near

Your turn review: filter()

Your turn review: filter()

Your turn review: filter()

Outline

Help and examples for a function

Load skimr package

skim for overview of data

Your turn

Using select with cars93

select() to select columns

Your turn: select()

mutate with cars93

mutate() to create a new column

Your turn: mutate()

Summary

Quiz

Review data transformation with `dplyr`

Data transformation with `dplyr`

Review key functions in `dplyr`

Your turn review: `library()`

Your turn review: `glimpse()`

Your turn review: `count()`

Filter: `between`

Filter: `near`

Your turn review: `filter()`

Your turn review: `filter()`

Your turn review: `filter()`

Load `skimr` package

Using `select` with `cars93`

`select()` to select columns

Your turn: `select()`

`mutate` with `cars93`

`mutate()` to create a new column

Your turn: `mutate()`