+ - 0:00:00
Notes for current slide
Notes for next slide

Data Wrangling with dplyr

1 / 53

Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into dplyr

2 / 53

Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into dplyr

  1. Packages are designed to be like grammars for their task. You can string these grammatical elements together to form more complex statements, just like with language.
3 / 53

Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into dplyr

  1. Packages are designed to be like grammars for their task. You can string these grammatical elements together to form more complex statements, just like with language.

  2. The first argument of (basically) every function is data. This is very handy, especially when it comes to piping.

4 / 53

Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into dplyr

  1. Packages are designed to be like grammars for their task. You can string these grammatical elements together to form more complex statements, just like with language.

  2. The first argument of (basically) every function is data. This is very handy, especially when it comes to piping.

  3. Variable names are usually not quoted (read more here)

5 / 53

Palmer penguins

library(palmerpenguins)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen,…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, …
## $ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
6 / 53

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of core verbs that help you solve the most common data manipulation challenges

7 / 53

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of core verbs that help you solve the most common data manipulation challenges


Manipulating observations

  • filter() picks cases based on their values.

  • arrange() changes the ordering of the rows.

8 / 53

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of core verbs that help you solve the most common data manipulation challenges


Manipulating observations

  • filter() picks cases based on their values.

  • arrange() changes the ordering of the rows.


Manipulating variables

  • select() picks variables based on their names.

  • mutate() adds new variables that are functions of existing variables

9 / 53

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of core verbs that help you solve the most common data manipulation challenges


Manipulating observations

  • filter() picks cases based on their values.

  • arrange() changes the ordering of the rows.


Manipulating variables

  • select() picks variables based on their names.

  • mutate() adds new variables that are functions of existing variables


Summarizing data

  • summarise() reduces multiple values down to a single summary.
10 / 53

Concept Map: dplyr

Source: rstudio/concept-maps

11 / 53

A review of pipes

x %>% f(y) is equivalent to f(x, y)

12 / 53

A review of pipes

x %>% f(y) is equivalent to f(x, y)


R Code

penguins %>%
filter(species == "Gentoo") %>%
select(bill_length_mm, bill_depth_mm) %>%
arrange(desc(bill_length_mm))
13 / 53

A review of pipes

x %>% f(y) is equivalent to f(x, y)


R Code

penguins %>%
filter(species == "Gentoo") %>%
select(bill_length_mm, bill_depth_mm) %>%
arrange(desc(bill_length_mm))

Translated into English

start with penguins data *AND THEN*
filter to include only observations from Gentoo penguins *AND THEN*
select only the columns `bill_length_mm` and `bill_depth_mm` *AND THEN*
arrange observations by descending order of `bill_length_mm`
14 / 53

A review of pipes

x %>% f(y) is equivalent to f(x, y)


R Code

penguins %>%
filter(species == "Gentoo") %>%
select(bill_length_mm, bill_depth_mm) %>%
arrange(desc(bill_length_mm))

Translated into English

start with penguins data *AND THEN*
filter to include only observations from Gentoo penguins *AND THEN*
select only the columns `bill_length_mm` and `bill_depth_mm` *AND THEN*
arrange observations by descending order of `bill_length_mm`

Read more on piping: https://magrittr.tidyverse.org/reference/pipe.html

15 / 53

Concept Map: %>%

Source: rstudio/concept-maps

16 / 53

Manipulating observations
(rows)

17 / 53

filter()

Subset observations (rows) with filter()

18 / 53

filter()

Artwork by @allison_horst

19 / 53

filter()

Logical operators

Source: R for Data Science

20 / 53

filter()

Comparisons

Operator Description Usage
< less than x < y
<= less than or equal to x <= y
> greater than x > y
>= greater than or equal to x >= y
== exactly equal to x == y
!= not equal to x != y
%in% group membership x %in% y
is.na is missing is.na(x)
!is.na is not missing !is.na(x)

Source: Alison Hill

21 / 53

filter()

filter(.data, ...)

.data = a data frame or tibble

. . . = Expressions that return a logical value, and are defined in terms of the variables in .data.

If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.

penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55)
## # A tibble: 3 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 59.6 17 230 6050 male 2007
## 2 Gentoo Biscoe 55.9 17 228 5600 male 2009
## 3 Gentoo Biscoe 55.1 16 230 5850 male 2009
penguins %>%
filter(species %in% c("Adelie", "Gentoo"),
island %in% c("Dream", "Torgersen"))
## # A tibble: 108 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 98 more rows
22 / 53

Concept Map: filter()

Source: rstudio/concept-maps

23 / 53

arrange()

Arrange rows by column values with arrange()

24 / 53

arrange()

arrange(.data, ...)

.data = a data frame or tibble

. . . = Variables to sort by. Use desc() to sort a variable in descending order.

penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g)
## # A tibble: 3 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 55.9 17 228 5600 male 2009
## 2 Gentoo Biscoe 55.1 16 230 5850 male 2009
## 3 Gentoo Biscoe 59.6 17 230 6050 male 2007
penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(desc(body_mass_g))
## # A tibble: 3 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 59.6 17 230 6050 male 2007
## 2 Gentoo Biscoe 55.1 16 230 5850 male 2009
## 3 Gentoo Biscoe 55.9 17 228 5600 male 2009
25 / 53

Your turn 1

05:00
  1. Import the file pragmatic_scales_data.csv (use best practices of a project-oriented workflow). Save it to an object called ps_data. This time we'll save it as a tibble (which I've done for you).

  2. Filter rows for cases in the "No Label" condition and arrange the resulting observations by descending order of age.

  3. Select observations from the “No Label” condition for kids 3 years old or younger.

  4. Select observations for kids between the ages of 3 and 4. Save the result to a new object called ps_filtered. (For an extra challenge, look up the documentation for between() and use this function instead of comparison operators).

26 / 53

Solution

# Q1.
ps_data <- rio::import(here::here("data", "pragmatic_scales_data.csv")) %>%
as_tibble()
# Q2.
ps_data %>%
filter(condition == "No Label") %>%
arrange(desc(age))
## # A tibble: 288 x 5
## subid item correct age condition
## <chr> <chr> <int> <dbl> <chr>
## 1 SCH23 faces 0 4.82 No Label
## 2 SCH23 houses 0 4.82 No Label
## 3 SCH23 pasta 0 4.82 No Label
## 4 SCH23 beds 0 4.82 No Label
## 5 SCH1 faces 0 4.82 No Label
## 6 SCH1 houses 0 4.82 No Label
## 7 SCH1 pasta 0 4.82 No Label
## 8 SCH1 beds 0 4.82 No Label
## 9 SCH19 faces 0 4.79 No Label
## 10 SCH19 houses 0 4.79 No Label
## # … with 278 more rows
# Q3.
ps_data %>%
filter(condition == "No Label" & age >= 3)
## # A tibble: 192 x 5
## subid item correct age condition
## <chr> <chr> <int> <dbl> <chr>
## 1 SCH35 faces 0 3.02 No Label
## 2 SCH35 houses 0 3.02 No Label
## 3 SCH35 pasta 0 3.02 No Label
## 4 SCH35 beds 0 3.02 No Label
## 5 MSCH40 faces 0 3.02 No Label
## 6 MSCH40 houses 1 3.02 No Label
## 7 MSCH40 pasta 0 3.02 No Label
## 8 MSCH40 beds 1 3.02 No Label
## 9 SCH34 faces 0 3.06 No Label
## 10 SCH34 houses 0 3.06 No Label
## # … with 182 more rows
# Q4.
ps_filtered <- ps_data %>%
filter(age >= 3 & age <= 4)
ps_filtered
## # A tibble: 204 x 5
## subid item correct age condition
## <chr> <chr> <int> <dbl> <chr>
## 1 T10 faces 0 3 Label
## 2 T10 houses 1 3 Label
## 3 T10 pasta 1 3 Label
## 4 T10 beds 1 3 Label
## 5 T3 faces 1 3.09 Label
## 6 T3 houses 1 3.09 Label
## 7 T3 pasta 1 3.09 Label
## 8 T3 beds 1 3.09 Label
## 9 T6 faces 1 3.1 Label
## 10 T6 houses 1 3.1 Label
## # … with 194 more rows
# Q4.
ps_filtered <- ps_data %>%
filter(between(age, 3, 4))
ps_filtered
## # A tibble: 204 x 5
## subid item correct age condition
## <chr> <chr> <int> <dbl> <chr>
## 1 T10 faces 0 3 Label
## 2 T10 houses 1 3 Label
## 3 T10 pasta 1 3 Label
## 4 T10 beds 1 3 Label
## 5 T3 faces 1 3.09 Label
## 6 T3 houses 1 3.09 Label
## 7 T3 pasta 1 3.09 Label
## 8 T3 beds 1 3.09 Label
## 9 T6 faces 1 3.1 Label
## 10 T6 houses 1 3.1 Label
## # … with 194 more rows
27 / 53

Manipulating variables
(columns)

28 / 53

select()

Select columns with select()

29 / 53

select()

select(.data, ...)

.data = a data frame or tibble

. . . = One or more unquoted expressions separated by commas.

Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.

penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g) %>%
select(species:bill_depth_mm)
## # A tibble: 3 x 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Gentoo Biscoe 55.9 17
## 2 Gentoo Biscoe 55.1 16
## 3 Gentoo Biscoe 59.6 17
penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g) %>%
select(species, starts_with("bill_"))
## # A tibble: 3 x 3
## species bill_length_mm bill_depth_mm
## <fct> <dbl> <dbl>
## 1 Gentoo 55.9 17
## 2 Gentoo 55.1 16
## 3 Gentoo 59.6 17
penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g) %>%
select(ends_with("_mm"))
## # A tibble: 3 x 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 55.9 17 228
## 2 55.1 16 230
## 3 59.6 17 230
30 / 53

Selection helpers

Selection helpers work in concert with select() to make it easier to select specific groups of variables.

31 / 53

Selection helpers

Selection helpers work in concert with select() to make it easier to select specific groups of variables.

Here are some commonly useful ones


everything(): Matches all variables.

last_col(): Select last variable, possibly with an offset.

starts_with(): Starts with a prefix.

ends_with(): Ends with a suffix.

contains(): Contains a literal string.

🔗 https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html#overview-of-selection-features

32 / 53

Concept Map: select()

Source: rstudio/concept-maps

33 / 53

mutate()

Create (or overwrite) variables with mutate()

34 / 53

mutate()

Artwork by @allison_horst

35 / 53

mutate()

mutate(.data, ...)

.data = a data frame or tibble

. . . = Name-value pairs. The name gives the name of the column in the output.

penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g) %>%
select(starts_with("bill_")) %>%
mutate(bill_length_m = bill_length_mm/1000)
## # A tibble: 3 x 3
## bill_length_mm bill_depth_mm bill_length_m
## <dbl> <dbl> <dbl>
## 1 55.9 17 0.0559
## 2 55.1 16 0.0551
## 3 59.6 17 0.0596
penguins %>%
filter(species == "Gentoo" & bill_length_mm > 55) %>%
arrange(body_mass_g) %>%
select(starts_with("bill_")) %>%
mutate(bill_length_mm = as.character(bill_length_mm))
## # A tibble: 3 x 2
## bill_length_mm bill_depth_mm
## <chr> <dbl>
## 1 55.9 17
## 2 55.1 16
## 3 59.6 17
36 / 53

Concept Map: mutate()

Source: rstudio/concept-maps

37 / 53

Your turn 2

04:00
  1. In ps_data, select only the variables age and condition.

  2. As we did with indexing in base R, you can use the minus sign (-) to "de-select" columns. Keep everything in ps_data except subid and condition.

  3. Select the columns correct and condition without naming them, using their positions or de-selecting other variables.

  4. Use mutate() to convert condition from type "character" to type "factor". Then check that you've done this successfully. Hint: what function could you pipe to directly to check this?

38 / 53

Solution

ps_data %>%
select(age, condition)
## # A tibble: 588 x 2
## age condition
## <dbl> <chr>
## 1 2 Label
## 2 2 Label
## 3 2 Label
## 4 2 Label
## 5 2.13 Label
## 6 2.13 Label
## 7 2.13 Label
## 8 2.13 Label
## 9 2.32 Label
## 10 2.32 Label
## # … with 578 more rows
ps_data %>%
select(-c(subid, condition))
## # A tibble: 588 x 3
## item correct age
## <chr> <int> <dbl>
## 1 faces 1 2
## 2 houses 1 2
## 3 pasta 0 2
## 4 beds 0 2
## 5 beds 0 2.13
## 6 faces 0 2.13
## 7 houses 1 2.13
## 8 pasta 1 2.13
## 9 pasta 0 2.32
## 10 faces 0 2.32
## # … with 578 more rows
ps_data %>%
select(starts_with("c"))
## # A tibble: 588 x 2
## correct condition
## <int> <chr>
## 1 1 Label
## 2 1 Label
## 3 0 Label
## 4 0 Label
## 5 0 Label
## 6 0 Label
## 7 1 Label
## 8 1 Label
## 9 0 Label
## 10 0 Label
## # … with 578 more rows
ps_data %>%
mutate(condition = as.factor(condition)) %>%
glimpse()
## Rows: 588
## Columns: 5
## $ subid <chr> "M22", "M22", "M22", "M22", "T22", "T22", "T22", "T22", "T17", "T17", "T17", "T17", …
## $ item <chr> "faces", "houses", "pasta", "beds", "beds", "faces", "houses", "pasta", "pasta", "fa…
## $ correct <int> 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ age <dbl> 2.00, 2.00, 2.00, 2.00, 2.13, 2.13, 2.13, 2.13, 2.32, 2.32, 2.32, 2.32, 2.38, 2.38, …
## $ condition <fct> Label, Label, Label, Label, Label, Label, Label, Label, Label, Label, Label, Label, …
39 / 53

Summarizing data

40 / 53

summarize()

summarize() reduces your raw data frame into to a smaller summary data frame that only contains the variables resulting from the summary functions that you specify within summarize()

41 / 53

summarize()

summarize() reduces your raw data frame into to a smaller summary data frame that only contains the variables resulting from the summary functions that you specify within summarize()


Summary functions take vectors as inputs and return single values as outputs

Common examples are mean(), sd(), max(), min(), sum(), etc...

42 / 53

summarize()

summarize(.data, ...)

.data = a data frame or tibble

. . . = Name-value pairs of summary functions. The name will be the name of the variable in the result.

penguins %>%
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
max_flipper_length = max(flipper_length_mm, na.rm = TRUE))
## # A tibble: 1 x 2
## mean_bill_length max_flipper_length
## <dbl> <int>
## 1 43.9 231
43 / 53

group_by()

group_by() creates groups based on one or more variables in the data, and this affects any downstream operations -- most commonly, summarize()

44 / 53

group_by()

What happens if we combine group_by() and summarize()?

45 / 53

summarize()

Let's see a couple examples of how we can combine group_by() and summarize()

penguins %>%
group_by(species) %>%
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## species mean_bill_length
## <fct> <dbl>
## 1 Adelie 38.8
## 2 Chinstrap 48.8
## 3 Gentoo 47.5
penguins %>%
group_by(species, island) %>%
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
## `summarise()` regrouping output by 'species' (override with `.groups` argument)
## # A tibble: 5 x 3
## # Groups: species [3]
## species island mean_bill_length
## <fct> <fct> <dbl>
## 1 Adelie Biscoe 39.0
## 2 Adelie Dream 38.5
## 3 Adelie Torgersen 39.0
## 4 Chinstrap Dream 48.8
## 5 Gentoo Biscoe 47.5
46 / 53

Concept Map: summarize()

Source: rstudio/concept-maps

47 / 53

Your turn 3

07:00
  1. From ps_data, get the total number of correct trials for each subject and call this variable num_correct. Hint: you can use the summary function sum().

  2. Now get the total number of correct trials (num_trials) for each unique combination of condition and item and arrange the resulting rows by descending order of num_correct. Which combination of condition and item had the most/least correct responses?

  3. Lastly, calculate the proportion of correct responses for each condition and call this variable prop_correct (hint: because correct is coded as 0 and 1, you can do this by taking the mean of correct). In the same pipeline, create a bar plot that shows the differences between the mean proportion of correct responses between the two conditions, and color the bars by condition. What do you notice? (hint: it might be more straightforward to use geom_col than geom_bar)

48 / 53

Solution

# Q1.
ps_data %>%
group_by(subid) %>%
summarize(num_correct = sum(correct))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 147 x 2
## subid num_correct
## <chr> <int>
## 1 C1 4
## 2 C10 2
## 3 C11 3
## 4 C12 3
## 5 C13 2
## 6 C14 3
## 7 C15 4
## 8 C16 2
## 9 C17 2
## 10 C18 3
## # … with 137 more rows
# Q2.
ps_data %>%
group_by(condition, item) %>%
summarize(num_correct = sum(correct)) %>%
arrange(desc(num_correct))
## `summarise()` regrouping output by 'condition' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups: condition [2]
## condition item num_correct
## <chr> <chr> <int>
## 1 Label beds 60
## 2 Label pasta 55
## 3 Label faces 49
## 4 Label houses 41
## 5 No Label pasta 18
## 6 No Label beds 15
## 7 No Label faces 13
## 8 No Label houses 12
Q3.
ps_data %>%
group_by(condition) %>%
summarize(prop_correct = mean(correct)) %>%
ggplot(aes(x = prop_correct, y = condition, fill = condition)) +
geom_col() +
coord_flip()

49 / 53

Column-wise operations

What if we want to apply dplyr verbs across multiple columns simultaneously? Check out these slides for more 👇

🔗 columnwise-operations-dplyr.netlify.app

50 / 53

Q & A

05:00
51 / 53

Next up...

Data tidying with tidyr

52 / 53

Break!

10:00
53 / 53

Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into dplyr

2 / 53
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow