+ - 0:00:00
Notes for current slide
Notes for next slide

Data Reshaping with tidyr

1 / 27

What is "tidy" data?

There are three interrelated rules which make a dataset tidy

2 / 27

What is "tidy" data?

There are three interrelated rules which make a dataset tidy

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

3 / 27

What is "tidy" data?

There are three interrelated rules which make a dataset tidy

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

Here's an image that shows these rules visually

Source: R for Data Science

4 / 27

Pivoting

5 / 27

Pivoting

Most of the data you encounter in the real world will unfortunately most likely not be in tidy format.

6 / 27

Pivoting

Most of the data you encounter in the real world will unfortunately most likely not be in tidy format.

This means you will need to reshape your data in order to tidy it.

7 / 27

Pivoting

Most of the data you encounter in the real world will unfortunately most likely not be in tidy format.

This means you will need to reshape your data in order to tidy it.


This is where the tidyr package comes in. It contains two crucial functions that you can use to reshape your data...

8 / 27

Pivoting

Most of the data you encounter in the real world will unfortunately most likely not be in tidy format.

This means you will need to reshape your data in order to tidy it.


This is where the tidyr package comes in. It contains two crucial functions that you can use to reshape your data...

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows.


Note: The old versions of these functions were called gather() and spread(), but were very confusing to use, and these newer functions are much more user-friendly.

9 / 27

Pivoting

Source: Mara Averick and Garrick Aden-Buie

10 / 27

Example data

Our example data is built into to the tidyr package.

It represents the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000

11 / 27

Example data

Our example data is built into to the tidyr package.

It represents the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000


We need to reshape this data to a longer format

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
12 / 27

Example data

Our example data is built into to the tidyr package.

It represents the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000


We need to reshape this data to a longer format

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
  • The column names 1999 and 2000 are not variables in our data, instead they represent values of the year variable

  • The values in the 1999 and 2000 columns represent values of the cases variable

  • Each row represents two observations, not one

13 / 27

pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
14 / 27

pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

We need 3 pieces of information to reshape (pivot) this data into a tidy format

15 / 27

pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

We need 3 pieces of information to reshape (pivot) this data into a tidy format

  • The set of columns in table 4a whose names are actually values, not variables. In this example, those are the columns 1999 and 2000.
16 / 27

pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

We need 3 pieces of information to reshape (pivot) this data into a tidy format

  • The set of columns in table 4a whose names are actually values, not variables. In this example, those are the columns 1999 and 2000.

  • The name of the variable to move the column names to. Here it is year.

17 / 27

pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

We need 3 pieces of information to reshape (pivot) this data into a tidy format

  • The set of columns in table 4a whose names are actually values, not variables. In this example, those are the columns 1999 and 2000.

  • The name of the variable to move the column names to. Here it is year.

  • The name of the variable to move the column values to. Here it’s cases.

18 / 27

pivot_longer()

pivot_longer(data, cols, names_to, values_to)

data = A data frame to pivot

cols = Columns to pivot into longer format

names_to = A string specifying the name of the column to create from the data stored in the column names of data

values_to = A string specifying the name of the column to create from the data stored in cell values

table4a %>%
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases")

Output: A longer data frame

## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766

Notes

  • We had to put the variable names 1999 and 2000 in backticks because column names are normally not allowed to start with numbers

  • year and cases do not exist in table4a so we put their names in quotes.

Here's a visual representation of what we just did.

19 / 27

pivot_wider()

Below is a different dataset (table2) that needs to be reshaped to a wider format.

A single observation is a country in a year, but each observation is spread across two rows.













## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
20 / 27

pivot_wider()

Below is a different dataset (table2) that needs to be reshaped to a wider format.

A single observation is a country in a year, but each observation is spread across two rows.













## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583













To tidy this up, we need two pieces of information:

-The column to take variable names from (type)

-The column to take values from (count)

21 / 27

pivot_wider()

pivot_wider(data, names_from, values_from)

data = A data frame to pivot

names_from = Which column (or columns) to get the name(s) of the output column(s) from

values_from = Which column (or columns) to get the cell values from

table2 %>%
pivot_wider(names_from = type,
values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Here's a visual representation of what we just did.

22 / 27

Your turn 1

05:00
  1. Import the file pragmatic_scales_data.csv (use best practices of a project-oriented workflow). Save it to an object called ps_data and convert to a tibble. View the data to remind yourself how it's structured.

  2. Use pivot_wider() to reformat the data so that there is a unique column for each item. The values in each of the four item columns should indicate whether or not the subject got that particular item right or wrong (i.e., what variable represents this?). Save this to a new object called ps_data_wide.

  3. Take ps_data_wide, which you just created, and reshape it back to the original longer format using pivot_longer().

23 / 27

Solution

# Q1.
ps_data <- rio::import(here::here("data", "pragmatic_scales_data.csv")) %>%
as_tibble()
# Q2.
ps_data_wide <- ps_data %>%
pivot_wider(names_from = item, values_from = correct)
ps_data_wide
## # A tibble: 150 x 7
## subid age condition faces houses pasta beds
## <chr> <dbl> <chr> <int> <int> <int> <int>
## 1 M22 2 Label 1 1 0 0
## 2 T22 2.13 Label 0 1 1 0
## 3 T17 2.32 Label 0 0 0 0
## 4 M3 2.38 Label 0 1 1 1
## 5 T19 2.47 Label 0 0 1 1
## 6 T20 2.5 Label 1 1 0 1
## 7 T21 2.58 Label 1 1 1 0
## 8 M26 2.59 Label 1 1 0 1
## 9 T18 2.61 Label 1 0 1 0
## 10 T12 2.72 Label 0 1 0 0
## # … with 140 more rows
# Q3.
ps_data_wide %>%
pivot_longer(c(faces, houses, pasta, beds),
names_to = "item",
values_to = "correct") %>%
select(subid, item, correct, everything()) # reorder columns
## # A tibble: 600 x 5
## subid item correct age condition
## <chr> <chr> <int> <dbl> <chr>
## 1 M22 faces 1 2 Label
## 2 M22 houses 1 2 Label
## 3 M22 pasta 0 2 Label
## 4 M22 beds 0 2 Label
## 5 T22 faces 0 2.13 Label
## 6 T22 houses 1 2.13 Label
## 7 T22 pasta 1 2.13 Label
## 8 T22 beds 0 2.13 Label
## 9 T17 faces 0 2.32 Label
## 10 T17 houses 0 2.32 Label
## # … with 590 more rows
24 / 27

Q & A

05:00
25 / 27

Next up...

Reproducible reporting with R Markdown

26 / 27

Break!

10:00
27 / 27

What is "tidy" data?

There are three interrelated rules which make a dataset tidy

2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow