More dplyr (package in the tidyverse)
dplyr
expects tidy data
each variable in its own column
each observation in its own row
works with pipes |>
functions covered last time
glimpse()
count()
dplyr
first argument is a data frame (aka tibble)
Subsequent arguments describe what to do with the data frame
dplyr
function() |
Action |
---|---|
glimpse() |
get a glimpse of your data |
count() |
count the unique values of one or more variables |
filter() |
picks rows based on their values |
mutate() |
creates new variables (columns) |
select() |
picks variables (columns) |
summarize() |
reduces multiple values down to a single statistic |
arrange() |
changes the order of the rows based on their values |
group_by() |
create subsets of data to apply functions to |
library()
Replace ??? in code chunk below to access:
functions in tidyverse
(includes dplyr
)
data: cars93
glimpse()
Replace ??? in code chunk below for an:
cars93
count()
Replace ??? in code chunk below:
mpg_city
do the most cars get?==
equality
>
greater than
<
less than
>=
greater than or equal to
<=
less than or equal to
!=
not equal to
between
numeric variable in a specified range
near
compare 2 numeric vectors. Set tolerance
|
or
&
and ,
between
# A tibble: 23 × 6
type price mpg_city drive_train passengers weight
<fct> <dbl> <int> <fct> <int> <int>
1 small 15.9 25 front 5 2705
2 midsize 30 22 rear 4 3640
3 midsize 15.7 22 front 6 2880
4 large 20.8 19 front 6 3470
5 large 23.7 16 rear 6 4105
6 midsize 26.3 19 front 5 3495
7 midsize 15.9 21 front 6 3195
8 large 18.8 17 rear 6 3910
9 large 18.4 20 front 6 3515
10 large 29.5 20 front 6 3570
# ℹ 13 more rows
near
# A tibble: 9 × 6
type price mpg_city drive_train passengers weight
<fct> <dbl> <int> <fct> <int> <int>
1 small 15.9 25 front 5 2705
2 midsize 15.7 22 front 6 2880
3 midsize 15.9 21 front 6 3195
4 midsize 15.6 21 front 6 3080
5 small 12.2 29 front 5 2295
6 small 12.1 42 front 4 2350
7 midsize 13.9 20 front 5 2885
8 midsize 14.9 19 rear 5 3610
9 midsize 16.3 23 front 5 2890
filter()
Extract the cars with the most common mpg_city
How many observations will you have?
filter()
mpg_city
that are large (type
)filter()
mpg_city
OR are large (type
)dplyr
new functions we will cover today:
select()
mutate()
skimr
skim()
summary statisticsin console:
?mutate
?dplyr::filter()
skimr
packageName | cars93 |
Number of rows | 54 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
type | 0 | 1 | FALSE | 3 | mid: 22, sma: 21, lar: 11 |
drive_train | 0 | 1 | FALSE | 3 | fro: 43, rea: 9, 4WD: 2 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
price | 0 | 1 | 19.99 | 11.51 | 7.4 | 10.95 | 17.25 | 26.25 | 61.9 | ▇▅▂▁▁ |
mpg_city | 0 | 1 | 23.31 | 6.62 | 16.0 | 19.00 | 21.00 | 28.00 | 46.0 | ▇▂▂▁▁ |
passengers | 0 | 1 | 5.11 | 0.69 | 4.0 | 5.00 | 5.00 | 6.00 | 6.0 | ▃▁▇▁▅ |
weight | 0 | 1 | 3037.41 | 657.66 | 1695.0 | 2452.50 | 3197.50 | 3522.50 | 4105.0 | ▂▆▃▇▃ |
How many observations?
How many variables are categorical (fct)?
How many variables are numerical?
Are there any missing values for any of the variables?
What is the mean (average) price ?
What is the maximum mpg_city
?
select
with cars93
select
is a function from the dplyr
package (part of the tidyverse
)
Used to select or rename columns
select()
to select columnsSelect the variables (columns) that are factors
select()
Select the variables that are numeric
mutate
with cars93
mutate
is a function from the dplyr
package (part of the tidyverse
)
Used to create or modify columns
mutate()
to create a new columnCreate a new column price_d
mutate()
Create a new column weight_kg
that is weight in kiligrams
To convert pounds (lbs) to kilograms (kgs), use the following formula:
\(kgs = lbs \times 0.45\)
glimpse()
count()
select()
mutate()
Two question on a result of code run on dataset from an OpenIntro Data Set
Seven questions on dplyr
functions