More dplyr (package in the tidyverse)
dplyr
expects tidy data
each variable in its own column
each observation in its own row
works with pipes |>
functions covered last time
glimpse()
count()
dplyr
first argument is a data frame (aka tibble)
Subsequent arguments describe what to do with the data frame
dplyr
function() |
Action |
|---|---|
glimpse() |
get a glimpse of your data |
count() |
count the unique values of one or more variables |
filter() |
picks rows based on their values |
mutate() |
creates new variables (columns) |
select() |
picks variables (columns) |
summarize() |
reduces multiple values down to a single statistic |
arrange() |
changes the order of the rows based on their values |
group_by() |
create subsets of data to apply functions to |
library()
Replace ??? in code chunk below to access:
functions in tidyverse (includes dplyr)
data: cars93
glimpse()
Replace ??? in code chunk below for an:
cars93
count()
Replace ??? in code chunk below:
mpg_city do the most cars get?== equality
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
between numeric variable in a specified range
near compare 2 numeric vectors. Set tolerance
| or
& and ,
between
# A tibble: 23 × 6
type price mpg_city drive_train passengers weight
<fct> <dbl> <int> <fct> <int> <int>
1 small 15.9 25 front 5 2705
2 midsize 30 22 rear 4 3640
3 midsize 15.7 22 front 6 2880
4 large 20.8 19 front 6 3470
5 large 23.7 16 rear 6 4105
6 midsize 26.3 19 front 5 3495
7 midsize 15.9 21 front 6 3195
8 large 18.8 17 rear 6 3910
9 large 18.4 20 front 6 3515
10 large 29.5 20 front 6 3570
# ℹ 13 more rows
near
# A tibble: 9 × 6
type price mpg_city drive_train passengers weight
<fct> <dbl> <int> <fct> <int> <int>
1 small 15.9 25 front 5 2705
2 midsize 15.7 22 front 6 2880
3 midsize 15.9 21 front 6 3195
4 midsize 15.6 21 front 6 3080
5 small 12.2 29 front 5 2295
6 small 12.1 42 front 4 2350
7 midsize 13.9 20 front 5 2885
8 midsize 14.9 19 rear 5 3610
9 midsize 16.3 23 front 5 2890
filter()
Extract the cars with the most common mpg_city
How many observations will you have?
filter()
mpg_city that are large (type)filter()
mpg_city OR are large (type)dplyr new functions we will cover today:
select()
mutate()
skimr
skim() summary statisticsin console:
?mutate?dplyr::filter()skimr package| Name | cars93 |
| Number of rows | 54 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| type | 0 | 1 | FALSE | 3 | mid: 22, sma: 21, lar: 11 |
| drive_train | 0 | 1 | FALSE | 3 | fro: 43, rea: 9, 4WD: 2 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 19.99 | 11.51 | 7.4 | 10.95 | 17.25 | 26.25 | 61.9 | ▇▅▂▁▁ |
| mpg_city | 0 | 1 | 23.31 | 6.62 | 16.0 | 19.00 | 21.00 | 28.00 | 46.0 | ▇▂▂▁▁ |
| passengers | 0 | 1 | 5.11 | 0.69 | 4.0 | 5.00 | 5.00 | 6.00 | 6.0 | ▃▁▇▁▅ |
| weight | 0 | 1 | 3037.41 | 657.66 | 1695.0 | 2452.50 | 3197.50 | 3522.50 | 4105.0 | ▂▆▃▇▃ |
How many observations?
How many variables are categorical (fct)?
How many variables are numerical?
Are there any missing values for any of the variables?
What is the mean (average) price ?
What is the maximum mpg_city?
select with cars93
select is a function from the dplyr package (part of the tidyverse)
Used to select or rename columns
select() to select columnsSelect the variables (columns) that are factors
select()
Select the variables that are numeric
mutate with cars93
mutate is a function from the dplyr package (part of the tidyverse)
Used to create or modify columns
mutate() to create a new columnCreate a new column price_d
mutate()
Create a new column weight_kg that is weight in kiligrams
To convert pounds (lbs) to kilograms (kgs), use the following formula:
\(kgs = lbs \times 0.45\)
glimpse()
count()
select()
mutate()
Two question on a result of code run on dataset from an OpenIntro Data Set
Seven questions on dplyr functions