Introduction to dplyr (package in the tidyverse)
dplyr
dplyr
is a powerful R package for data manipulation.
one of the packages in the tidyverse
Provides a coherent set of verbs functions to help you resolve most data manipulation challenges
dplyr
expects tidy data
each variable in its own column
each observation in its own row
works with pipes |> (or %>% )
dplyr
first argument is a data frame
subsequent arguments describe what to do with the data frame
result is a new data frame
dplyr
function() |
Action |
---|---|
glimpse() |
get a glimpse of your data |
count() |
count the unique values of one or more variables |
filter() |
picks rows based on their values |
mutate() |
creates new variables (columns) |
select() |
picks variables (columns) |
summarize() |
reduces multiple values down to a single statistic |
arrange() |
changes the order of the rows based on their values |
group_by() |
create subsets of data to apply functions to |
Functions we will cover today:
glimpse()
count()
openintro
package?cars93
for the help pageEach row is a car (observation)
Variables (columns) contain information on a car
Variables
type
- levels large, midsize, and small
price
mpg_city
drive_train
- levels 4WD, front, and rear
passengers
weight
fct
refers to categories or levels of data
int
and dbl
refer to “integer” and “double” or numerical data
cars93
are
Rows: 54
Columns: 6
$ type <fct> small, midsize, midsize, midsize, midsize, large, large, m…
$ price <dbl> 15.9, 33.9, 37.7, 30.0, 15.7, 20.8, 23.7, 26.3, 34.7, 40.1…
$ mpg_city <int> 25, 18, 19, 22, 22, 19, 16, 19, 16, 16, 21, 17, 20, 20, 29…
$ drive_train <fct> front, front, front, rear, front, front, rear, front, fron…
$ passengers <int> 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 6, 6, 6, 6, 5, 5, 6, 5, 6, 4…
$ weight <int> 2705, 3560, 3405, 3640, 2880, 3470, 4105, 3495, 3620, 3935…
type
type | n |
---|---|
midsize | 22 |
small | 21 |
large | 11 |
type
has the highest number of observations?Which drive_train
has the highest number of observations?
picks rows based on their values
==
equality
>
greater than
<
less than
>=
greater than or equal to
<=
less than or equal to
!=
not equal to
between
numeric variable in a specified range
near
compare 2 numeric vectors. Set tolerance
Combine criteria using operators that make comparisons:
|
or
&
and ,
type | price | mpg_city | drive_train | passengers | weight |
---|---|---|---|---|---|
midsize | 33.9 | 18 | front | 5 | 3560 |
midsize | 37.7 | 19 | front | 6 | 3405 |
midsize | 30.0 | 22 | rear | 4 | 3640 |
midsize | 15.7 | 22 | front | 6 | 2880 |
midsize | 26.3 | 19 | front | 5 | 3495 |
midsize | 40.1 | 16 | front | 5 | 3935 |
midsize | 15.9 | 21 | front | 6 | 3195 |
midsize | 15.6 | 21 | front | 6 | 3080 |
midsize | 20.2 | 21 | front | 5 | 3325 |
midsize | 13.9 | 20 | front | 5 | 2885 |
midsize | 47.9 | 17 | rear | 5 | 4000 |
midsize | 28.0 | 18 | front | 5 | 3510 |
midsize | 35.2 | 18 | rear | 4 | 3515 |
midsize | 34.3 | 17 | front | 6 | 3695 |
midsize | 61.9 | 19 | rear | 5 | 3525 |
midsize | 14.9 | 19 | rear | 5 | 3610 |
midsize | 26.1 | 18 | front | 5 | 3730 |
midsize | 21.5 | 21 | front | 5 | 3200 |
midsize | 16.3 | 23 | front | 5 | 2890 |
midsize | 18.5 | 19 | front | 5 | 3450 |
midsize | 18.2 | 22 | front | 5 | 3030 |
midsize | 26.7 | 20 | front | 5 | 3245 |
Extract the cars that have front drive_train
Extract the cars that are type
large AND have 4WD drive_train
Extract the cars that are type
large OR have 4WD drive_train
glimpse()
count()
One question on a dataset from an OpenIntro Data Set
Seven questions on dplyr
functions