Lab 3B: Manipulating, Analyzing, and Exporting Data with tidyverse

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

The results from a base R function sometimes depend on the type of data.
Using R expressions in a non standard way, which can be confusing for new learners.
Hidden arguments, having default operations that new learners are not aware of.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# download data file from online 
download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "data_raw/portal_data_joined.csv")

surveys <- read_csv("data_raw/portal_data_joined.csv")

## Parsed with column specification:
## cols(
##   record_id = col_double(),
##   month = col_double(),
##   day = col_double(),
##   year = col_double(),
##   plot_id = col_double(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_double(),
##   weight = col_double(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

# notice this is not read.csv(); read_csv() is in the readr package

# inspect the data 
str(surveys)

## tibble [34,786 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ record_id      : num [1:34786] 1 72 224 266 349 363 435 506 588 661 ...
##  $ month          : num [1:34786] 7 8 9 10 11 11 12 1 2 3 ...
##  $ day            : num [1:34786] 16 19 13 16 12 12 10 8 18 11 ...
##  $ year           : num [1:34786] 1977 1977 1977 1977 1977 ...
##  $ plot_id        : num [1:34786] 2 2 2 2 2 2 2 2 2 2 ...
##  $ species_id     : chr [1:34786] "NL" "NL" "NL" "NL" ...
##  $ sex            : chr [1:34786] "M" "M" NA NA ...
##  $ hindfoot_length: num [1:34786] 32 31 NA NA NA NA NA NA NA NA ...
##  $ weight         : num [1:34786] NA NA NA NA NA NA NA NA 218 NA ...
##  $ genus          : chr [1:34786] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
##  $ species        : chr [1:34786] "albigula" "albigula" "albigula" "albigula" ...
##  $ taxa           : chr [1:34786] "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type      : chr [1:34786] "Control" "Control" "Control" "Control" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   record_id = col_double(),
##   ..   month = col_double(),
##   ..   day = col_double(),
##   ..   year = col_double(),
##   ..   plot_id = col_double(),
##   ..   species_id = col_character(),
##   ..   sex = col_character(),
##   ..   hindfoot_length = col_double(),
##   ..   weight = col_double(),
##   ..   genus = col_character(),
##   ..   species = col_character(),
##   ..   taxa = col_character(),
##   ..   plot_type = col_character()
##   .. )

Differences between a dataframe and a tibble:

In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
Columns of class character are never converted into factors.

Filter and select:

# select certain columns 
select(surveys, plot_id, species_id, weight)

## # A tibble: 34,786 x 3
##    plot_id species_id weight
##      <dbl> <chr>       <dbl>
##  1       2 NL             NA
##  2       2 NL             NA
##  3       2 NL             NA
##  4       2 NL             NA
##  5       2 NL             NA
##  6       2 NL             NA
##  7       2 NL             NA
##  8       2 NL             NA
##  9       2 NL            218
## 10       2 NL             NA
## # … with 34,776 more rows

# select all columns except certain ones 
select(surveys, -record_id, -species_id)

## # A tibble: 34,786 x 11
##    month   day  year plot_id sex   hindfoot_length weight genus species taxa 
##    <dbl> <dbl> <dbl>   <dbl> <chr>           <dbl>  <dbl> <chr> <chr>   <chr>
##  1     7    16  1977       2 M                  32     NA Neot… albigu… Rode…
##  2     8    19  1977       2 M                  31     NA Neot… albigu… Rode…
##  3     9    13  1977       2 <NA>               NA     NA Neot… albigu… Rode…
##  4    10    16  1977       2 <NA>               NA     NA Neot… albigu… Rode…
##  5    11    12  1977       2 <NA>               NA     NA Neot… albigu… Rode…
##  6    11    12  1977       2 <NA>               NA     NA Neot… albigu… Rode…
##  7    12    10  1977       2 <NA>               NA     NA Neot… albigu… Rode…
##  8     1     8  1978       2 <NA>               NA     NA Neot… albigu… Rode…
##  9     2    18  1978       2 M                  NA    218 Neot… albigu… Rode…
## 10     3    11  1978       2 <NA>               NA     NA Neot… albigu… Rode…
## # … with 34,776 more rows, and 1 more variable: plot_type <chr>

# choose rows based on specific criterion 
filter(surveys, year == 1995)

## # A tibble: 1,180 x 13
##    record_id month   day  year plot_id species_id sex   hindfoot_length weight
##        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
##  1     22314     6     7  1995       2 NL         M                  34     NA
##  2     22728     9    23  1995       2 NL         F                  32    165
##  3     22899    10    28  1995       2 NL         F                  32    171
##  4     23032    12     2  1995       2 NL         F                  33     NA
##  5     22003     1    11  1995       2 DM         M                  37     41
##  6     22042     2     4  1995       2 DM         F                  36     45
##  7     22044     2     4  1995       2 DM         M                  37     46
##  8     22105     3     4  1995       2 DM         F                  37     49
##  9     22109     3     4  1995       2 DM         M                  37     46
## 10     22168     4     1  1995       2 DM         M                  36     48
## # … with 1,170 more rows, and 4 more variables: genus <chr>, species <chr>,
## #   taxa <chr>, plot_type <chr>

Piping:

What if you want to select and filter at the same time?

There are three ways to do this: use intermediate steps, nested functions, or pipes.

# Intermediate steps: 
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)

# Nest functions: 
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)

These can be difficult to read… Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_sml)

## # A tibble: 6 x 3
##   species_id sex   weight
##   <chr>      <chr>  <dbl>
## 1 PF         F          4
## 2 PF         F          4
## 3 PF         M          4
## 4 RM         F          4
## 5 RM         M          4
## 6 PF         <NA>       4

Challenge:

Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.

surveys_sub <- surveys %>% 
  filter(year > 1995) %>% 
  select(year, sex, weight)

head(surveys_sub)

## # A tibble: 6 x 3
##    year sex   weight
##   <dbl> <chr>  <dbl>
## 1  1996 M        127
## 2  1996 F        130
## 3  1996 M        148
## 4  1997 M        147
## 5  1997 F        144
## 6  1997 M        193

Mutate

allows us to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns.

# create a new column of weight in kg: 

kg_added <- surveys %>% 
  mutate(weight_kg = weight/1000)

kg_and_lb_added <- surveys %>% 
  mutate(weight_kg = weight/1000, 
         weight_lb = weight_kg * 2.2)

# if we wanted to remove NAs in weight column: 

surveys %>%
  filter(!is.na(weight)) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

## # A tibble: 6 x 14
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1       588     2    18  1978       2 NL         M                  NA    218
## 2       845     5     6  1978       2 NL         M                  32    204
## 3       990     6     9  1978       2 NL         M                  NA    200
## 4      1164     8     5  1978       2 NL         M                  34    199
## 5      1261     9     4  1978       2 NL         M                  32    197
## 6      1453    11     5  1978       2 NL         M                  NA    218
## # … with 5 more variables: genus <chr>, species <chr>, taxa <chr>,
## #   plot_type <chr>, weight_kg <dbl>

Challenge:

Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_cm containing the hindfoot_length values converted to centimeters. In this hindfoot_cm column, there are no NAs and all values are less than 3.

new_HFL <- surveys %>% 
  mutate(hindfoot_cm = hindfoot_length / 10) %>% 
  filter(!is.na(hindfoot_cm)) %>% 
  filter(hindfoot_cm < 3) %>% 
  select(species_id, hindfoot_cm)

head(new_HFL)

## # A tibble: 6 x 2
##   species_id hindfoot_cm
##   <chr>            <dbl>
## 1 NL                 2.8
## 2 NL                 2.1
## 3 NL                 2.1
## 4 NL                 2.9
## 5 NL                 2.9
## 6 DM                 2.5

Split-apply-combine data analysis and the summarize() function

Split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The group_by() function is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calcuate the summary statistics.

# to compute the mean weight by sex: 

surveys %>% 
  group_by(sex) %>% 
  summarize(mean_weight = mean(weight, na.rm = TRUE)) %>% 
  tail()

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   sex   mean_weight
##   <chr>       <dbl>
## 1 F            42.2
## 2 M            43.0
## 3 <NA>         64.7

We can see that the sex column contains NA values because some animals had escaped before their sex and body weights could be determined. The resulting mean_weight column does not contain NA but NaN (which refers to “Not a Number”) because mean() was called on a vector of NA values whileat the same time setting na.rm = TRUE. To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed first, we can omit na.rm = TRUE when computing the mean:

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight))

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 64 x 3
## # Groups:   sex [3]
##    sex   species_id mean_weight
##    <chr> <chr>            <dbl>
##  1 F     BA                9.16
##  2 F     DM               41.6 
##  3 F     DO               48.5 
##  4 F     DS              118.  
##  5 F     NL              154.  
##  6 F     OL               31.1 
##  7 F     OT               24.8 
##  8 F     OX               21   
##  9 F     PB               30.2 
## 10 F     PE               22.8 
## # … with 54 more rows

# specify number of rows to display with print(n=)

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight)) %>%
  print(n = 15)

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 64 x 3
## # Groups:   sex [3]
##    sex   species_id mean_weight
##    <chr> <chr>            <dbl>
##  1 F     BA                9.16
##  2 F     DM               41.6 
##  3 F     DO               48.5 
##  4 F     DS              118.  
##  5 F     NL              154.  
##  6 F     OL               31.1 
##  7 F     OT               24.8 
##  8 F     OX               21   
##  9 F     PB               30.2 
## 10 F     PE               22.8 
## 11 F     PF                7.97
## 12 F     PH               30.8 
## 13 F     PL               19.3 
## 14 F     PM               22.1 
## 15 F     PP               17.2 
## # … with 49 more rows

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight))

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 64 x 4
## # Groups:   sex [3]
##    sex   species_id mean_weight min_weight
##    <chr> <chr>            <dbl>      <dbl>
##  1 F     BA                9.16          6
##  2 F     DM               41.6          10
##  3 F     DO               48.5          12
##  4 F     DS              118.           45
##  5 F     NL              154.           32
##  6 F     OL               31.1          10
##  7 F     OT               24.8           5
##  8 F     OX               21            20
##  9 F     PB               30.2          12
## 10 F     PE               22.8          11
## # … with 54 more rows

# We can sort on min_weight to put the lighter species first 

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight)) %>%
  arrange(min_weight)

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 64 x 4
## # Groups:   sex [3]
##    sex   species_id mean_weight min_weight
##    <chr> <chr>            <dbl>      <dbl>
##  1 F     PF                7.97          4
##  2 F     RM               11.1           4
##  3 M     PF                7.89          4
##  4 M     PP               17.2           4
##  5 M     RM               10.1           4
##  6 <NA>  PF                6             4
##  7 F     OT               24.8           5
##  8 F     PP               17.2           5
##  9 F     BA                9.16          6
## 10 M     BA                7.36          6
## # … with 54 more rows

# and we can sort in descending order... 

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight)) %>%
  arrange(desc(mean_weight))

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 64 x 4
## # Groups:   sex [3]
##    sex   species_id mean_weight min_weight
##    <chr> <chr>            <dbl>      <dbl>
##  1 <NA>  NL               168.          83
##  2 M     NL               166.          30
##  3 F     NL               154.          32
##  4 M     SS               130          130
##  5 <NA>  SH               130          130
##  6 M     DS               122.          12
##  7 <NA>  DS               120           78
##  8 F     DS               118.          45
##  9 F     SH                78.8         30
## 10 F     SF                69           46
## # … with 54 more rows

Counting

We often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count().

# if we wanted to count the number of rows of data for each sex: 

surveys %>% 
  count(sex)

## # A tibble: 3 x 2
##   sex       n
##   <chr> <int>
## 1 F     15690
## 2 M     17348
## 3 <NA>   1748

# which is equivalent to: 

surveys %>% 
  group_by(sex) %>% 
  summarize(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   sex   count
##   <chr> <int>
## 1 F     15690
## 2 M     17348
## 3 <NA>   1748

# there is a sort argument for count() 

surveys %>% 
  count(sex, sort = T)

## # A tibble: 3 x 2
##   sex       n
##   <chr> <int>
## 1 M     17348
## 2 F     15690
## 3 <NA>   1748

# multiple factors: specify the order 

surveys %>%
  count(sex, species)

## # A tibble: 81 x 3
##    sex   species         n
##    <chr> <chr>       <int>
##  1 F     albigula      675
##  2 F     baileyi      1646
##  3 F     eremicus      579
##  4 F     flavus        757
##  5 F     fulvescens     57
##  6 F     fulviventer    17
##  7 F     hispidus       99
##  8 F     leucogaster   475
##  9 F     leucopus       16
## 10 F     maniculatus   382
## # … with 71 more rows

surveys %>%
  count(species, sex)

## # A tibble: 81 x 3
##    species         sex       n
##    <chr>           <chr> <int>
##  1 albigula        F       675
##  2 albigula        M       502
##  3 albigula        <NA>     75
##  4 audubonii       <NA>     75
##  5 baileyi         F      1646
##  6 baileyi         M      1216
##  7 baileyi         <NA>     29
##  8 bilineata       <NA>    303
##  9 brunneicapillus <NA>     50
## 10 chlorurus       <NA>     39
## # … with 71 more rows

# arrange species in alphabetical order and count in descending order 

surveys %>%
  count(sex, species) %>%
  arrange(species, desc(n))

## # A tibble: 81 x 3
##    sex   species             n
##    <chr> <chr>           <int>
##  1 F     albigula          675
##  2 M     albigula          502
##  3 <NA>  albigula           75
##  4 <NA>  audubonii          75
##  5 F     baileyi          1646
##  6 M     baileyi          1216
##  7 <NA>  baileyi            29
##  8 <NA>  bilineata         303
##  9 <NA>  brunneicapillus    50
## 10 <NA>  chlorurus          39
## # … with 71 more rows

Challenge:

How many animals were caught in each plot_type surveyed?

surveys %>% 
  count(plot_type)

## # A tibble: 5 x 2
##   plot_type                     n
##   <chr>                     <int>
## 1 Control                   15611
## 2 Long-term Krat Exclosure   5118
## 3 Rodent Exclosure           4233
## 4 Short-term Krat Exclosure  5906
## 5 Spectab exclosure          3918

Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id). Also add the number of observations (hint: see ?n).

spp_HFL <- surveys %>% 
  group_by(species_id) %>% 
  summarize(mean_length = mean(hindfoot_length),
            min_length = min(hindfoot_length),
            max_length = max(hindfoot_length))

## `summarise()` ungrouping output (override with `.groups` argument)

head(spp_HFL)

## # A tibble: 6 x 4
##   species_id mean_length min_length max_length
##   <chr>            <dbl>      <dbl>      <dbl>
## 1 AB                  NA         NA         NA
## 2 AH                  NA         NA         NA
## 3 AS                  NA         NA         NA
## 4 BA                  NA         NA         NA
## 5 CB                  NA         NA         NA
## 6 CM                  NA         NA         NA

# Oh, it can't calculate means with NAs; let's remove...  

spp_HFL <- surveys %>% 
  filter(!is.na(hindfoot_length)) %>%
  group_by(species_id) %>% 
  summarize(mean_length = mean(hindfoot_length),
            min_length = min(hindfoot_length),
            max_length = max(hindfoot_length),
            n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

head(spp_HFL)

## # A tibble: 6 x 5
##   species_id mean_length min_length max_length     n
##   <chr>            <dbl>      <dbl>      <dbl> <int>
## 1 AH                33           31         35     2
## 2 BA                13            6         16    45
## 3 DM                36.0         16         50  9972
## 4 DO                35.6         26         64  2887
## 5 DS                49.9         39         58  2132
## 6 NL                32.3         21         70  1074

What is the heaviest animal measured each year? Return the columns year, genus, species_id, and weight

spp_weights <- surveys %>% 
  filter(!is.na(weight)) %>%
  group_by(year) %>% 
  filter(weight == max(weight)) %>% # filter, not summarize! 
  select(year, species_id, genus, weight) %>% 
  arrange(year)

head(spp_weights)

## # A tibble: 6 x 4
## # Groups:   year [5]
##    year species_id genus     weight
##   <dbl> <chr>      <chr>      <dbl>
## 1  1977 DS         Dipodomys    149
## 2  1978 NL         Neotoma      232
## 3  1978 NL         Neotoma      232
## 4  1979 NL         Neotoma      274
## 5  1980 NL         Neotoma      243
## 6  1981 NL         Neotoma      264

Reshaping with gather and spread

spread() takes three principal arguments: 1. the data 2. the key column variable whose values will become new column names 3. the value column whose values will fill the new column variables

further arguments include fill which, if set, fills in missing values with the value provided

use spread() to transform surveys to find the mean weight of each genus in each plot over the entire period.

surveys_gw <- surveys %>% 
  filter(!is.na(weight)) %>% 
  group_by(plot_id, genus) %>% 
  summarize(mean_weight = mean(weight))

## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)

head(surveys_gw)

## # A tibble: 6 x 3
## # Groups:   plot_id [1]
##   plot_id genus       mean_weight
##     <dbl> <chr>             <dbl>
## 1       1 Baiomys            7   
## 2       1 Chaetodipus       22.2 
## 3       1 Dipodomys         60.2 
## 4       1 Neotoma          156.  
## 5       1 Onychomys         27.7 
## 6       1 Perognathus        9.62

str(surveys_gw)

## tibble [196 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ plot_id    : num [1:196] 1 1 1 1 1 1 1 1 2 2 ...
##  $ genus      : chr [1:196] "Baiomys" "Chaetodipus" "Dipodomys" "Neotoma" ...
##  $ mean_weight: num [1:196] 7 22.2 60.2 156.2 27.7 ...
##  - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ .rows  : list<int> [1:24] 
##   .. ..$ : int [1:8] 1 2 3 4 5 6 7 8
##   .. ..$ : int [1:9] 9 10 11 12 13 14 15 16 17
##   .. ..$ : int [1:9] 18 19 20 21 22 23 24 25 26
##   .. ..$ : int [1:8] 27 28 29 30 31 32 33 34
##   .. ..$ : int [1:9] 35 36 37 38 39 40 41 42 43
##   .. ..$ : int [1:8] 44 45 46 47 48 49 50 51
##   .. ..$ : int [1:7] 52 53 54 55 56 57 58
##   .. ..$ : int [1:7] 59 60 61 62 63 64 65
##   .. ..$ : int [1:8] 66 67 68 69 70 71 72 73
##   .. ..$ : int [1:7] 74 75 76 77 78 79 80
##   .. ..$ : int [1:8] 81 82 83 84 85 86 87 88
##   .. ..$ : int [1:8] 89 90 91 92 93 94 95 96
##   .. ..$ : int [1:8] 97 98 99 100 101 102 103 104
##   .. ..$ : int [1:8] 105 106 107 108 109 110 111 112
##   .. ..$ : int [1:8] 113 114 115 116 117 118 119 120
##   .. ..$ : int [1:7] 121 122 123 124 125 126 127
##   .. ..$ : int [1:8] 128 129 130 131 132 133 134 135
##   .. ..$ : int [1:9] 136 137 138 139 140 141 142 143 144
##   .. ..$ : int [1:9] 145 146 147 148 149 150 151 152 153
##   .. ..$ : int [1:10] 154 155 156 157 158 159 160 161 162 163
##   .. ..$ : int [1:9] 164 165 166 167 168 169 170 171 172
##   .. ..$ : int [1:8] 173 174 175 176 177 178 179 180
##   .. ..$ : int [1:8] 181 182 183 184 185 186 187 188
##   .. ..$ : int [1:8] 189 190 191 192 193 194 195 196
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Observations for each plot are spread across multiple rows, 196 obs of three variabes.

surveys_spread <- surveys_gw %>% 
  spread(key = genus, value = mean_weight)

head(surveys_spread)

## # A tibble: 6 x 11
## # Groups:   plot_id [6]
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
##     <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62       22.2
## 2       2    6           25.1      55.7    169.      26.9        6.95       22.3
## 3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
## 4       4   NA           23.0      57.5    164.      28.1        7.82       22.6
## 5       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
## 6       6   NA           24.9      58.6    180.      25.9        7.81       21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## #   Spermophilus <dbl>

str(surveys_spread)

## tibble [24 × 11] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ plot_id        : num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Baiomys        : num [1:24] 7 6 8.61 NA 7.75 ...
##  $ Chaetodipus    : num [1:24] 22.2 25.1 24.6 23 18 ...
##  $ Dipodomys      : num [1:24] 60.2 55.7 52 57.5 51.1 ...
##  $ Neotoma        : num [1:24] 156 169 158 164 190 ...
##  $ Onychomys      : num [1:24] 27.7 26.9 26 28.1 27 ...
##  $ Perognathus    : num [1:24] 9.62 6.95 7.51 7.82 8.66 ...
##  $ Peromyscus     : num [1:24] 22.2 22.3 21.4 22.6 21.2 ...
##  $ Reithrodontomys: num [1:24] 11.4 10.7 10.5 10.3 11.2 ...
##  $ Sigmodon       : num [1:24] NA 70.9 65.6 82 82.7 ...
##  $ Spermophilus   : num [1:24] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ .rows  : list<int> [1:24] 
##   .. ..$ : int 1
##   .. ..$ : int 2
##   .. ..$ : int 3
##   .. ..$ : int 4
##   .. ..$ : int 5
##   .. ..$ : int 6
##   .. ..$ : int 7
##   .. ..$ : int 8
##   .. ..$ : int 9
##   .. ..$ : int 10
##   .. ..$ : int 11
##   .. ..$ : int 12
##   .. ..$ : int 13
##   .. ..$ : int 14
##   .. ..$ : int 15
##   .. ..$ : int 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int 24
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

surveys_gw %>%
  spread(genus, mean_weight, fill = 0) %>%
  head()

## # A tibble: 6 x 11
## # Groups:   plot_id [6]
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
##     <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62       22.2
## 2       2    6           25.1      55.7    169.      26.9        6.95       22.3
## 3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
## 4       4    0           23.0      57.5    164.      28.1        7.82       22.6
## 5       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
## 6       6    0           24.9      58.6    180.      25.9        7.81       21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## #   Spermophilus <dbl>

gather() can be used to do the opposite; takes in four principle arguments: 1. the data 2. the key column variable we wish to create from column names. #. the values column variable we wish to create and fill with values associated with the key. 4. the names of the columns we use to fill the key variable (or to drop).

surveys_gather <- surveys_spread %>%
  gather(key = "genus", value = "mean_weight", -plot_id)

head(surveys_gather)

## # A tibble: 6 x 3
## # Groups:   plot_id [6]
##   plot_id genus   mean_weight
##     <dbl> <chr>         <dbl>
## 1       1 Baiomys        7   
## 2       2 Baiomys        6   
## 3       3 Baiomys        8.61
## 4       4 Baiomys       NA   
## 5       5 Baiomys        7.75
## 6       6 Baiomys       NA

str(surveys_gather)

## tibble [240 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ plot_id    : num [1:240] 1 2 3 4 5 6 7 8 9 10 ...
##  $ genus      : chr [1:240] "Baiomys" "Baiomys" "Baiomys" "Baiomys" ...
##  $ mean_weight: num [1:240] 7 6 8.61 NA 7.75 ...
##  - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ .rows  : list<int> [1:24] 
##   .. ..$ : int [1:10] 1 25 49 73 97 121 145 169 193 217
##   .. ..$ : int [1:10] 2 26 50 74 98 122 146 170 194 218
##   .. ..$ : int [1:10] 3 27 51 75 99 123 147 171 195 219
##   .. ..$ : int [1:10] 4 28 52 76 100 124 148 172 196 220
##   .. ..$ : int [1:10] 5 29 53 77 101 125 149 173 197 221
##   .. ..$ : int [1:10] 6 30 54 78 102 126 150 174 198 222
##   .. ..$ : int [1:10] 7 31 55 79 103 127 151 175 199 223
##   .. ..$ : int [1:10] 8 32 56 80 104 128 152 176 200 224
##   .. ..$ : int [1:10] 9 33 57 81 105 129 153 177 201 225
##   .. ..$ : int [1:10] 10 34 58 82 106 130 154 178 202 226
##   .. ..$ : int [1:10] 11 35 59 83 107 131 155 179 203 227
##   .. ..$ : int [1:10] 12 36 60 84 108 132 156 180 204 228
##   .. ..$ : int [1:10] 13 37 61 85 109 133 157 181 205 229
##   .. ..$ : int [1:10] 14 38 62 86 110 134 158 182 206 230
##   .. ..$ : int [1:10] 15 39 63 87 111 135 159 183 207 231
##   .. ..$ : int [1:10] 16 40 64 88 112 136 160 184 208 232
##   .. ..$ : int [1:10] 17 41 65 89 113 137 161 185 209 233
##   .. ..$ : int [1:10] 18 42 66 90 114 138 162 186 210 234
##   .. ..$ : int [1:10] 19 43 67 91 115 139 163 187 211 235
##   .. ..$ : int [1:10] 20 44 68 92 116 140 164 188 212 236
##   .. ..$ : int [1:10] 21 45 69 93 117 141 165 189 213 237
##   .. ..$ : int [1:10] 22 46 70 94 118 142 166 190 214 238
##   .. ..$ : int [1:10] 23 47 71 95 119 143 167 191 215 239
##   .. ..$ : int [1:10] 24 48 72 96 120 144 168 192 216 240
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

# you can also specify what columns to include - but for some reason when I tried to change the range of genera, it got messed up
surveys_spread %>%
  gather(key = "genus", value = "mean_weight", Baiomys:Spermophilus)

## # A tibble: 240 x 3
## # Groups:   plot_id [24]
##    plot_id genus   mean_weight
##      <dbl> <chr>         <dbl>
##  1       1 Baiomys        7   
##  2       2 Baiomys        6   
##  3       3 Baiomys        8.61
##  4       4 Baiomys       NA   
##  5       5 Baiomys        7.75
##  6       6 Baiomys       NA   
##  7       7 Baiomys       NA   
##  8       8 Baiomys       NA   
##  9       9 Baiomys       NA   
## 10      10 Baiomys       NA   
## # … with 230 more rows

Challenge:

Spread the surveys data frame with year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data. It’s a powerful function! See ?n_distinct for more.

genera_per_plot <- surveys %>% 
  group_by(plot_id, year) %>% 
  summarize(genera = n_distinct(genus))

## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)

  # I guess it knows to do distinct per plot_id and not year because it is listed first? 

# now lets add spread 

genera_per_plot <- surveys %>% 
  group_by(plot_id, year) %>% 
  summarize(genera = n_distinct(genus)) %>%
  spread(year, genera)

## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)

head(genera_per_plot)

## # A tibble: 6 x 27
## # Groups:   plot_id [6]
##   plot_id `1977` `1978` `1979` `1980` `1981` `1982` `1983` `1984` `1985` `1986`
##     <dbl>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1       1      2      3      4      7      5      6      7      6      4      3
## 2       2      6      6      6      8      5      9      9      9      6      4
## 3       3      5      6      4      6      6      8     10     11      7      6
## 4       4      4      4      3      4      5      4      6      3      4      3
## 5       5      4      3      2      5      4      6      7      7      3      1
## 6       6      3      4      3      4      5      9      9      7      5      6
## # … with 16 more variables: `1987` <int>, `1988` <int>, `1989` <int>,
## #   `1990` <int>, `1991` <int>, `1992` <int>, `1993` <int>, `1994` <int>,
## #   `1995` <int>, `1996` <int>, `1997` <int>, `1998` <int>, `1999` <int>,
## #   `2000` <int>, `2001` <int>, `2002` <int>

Now take that data frame and gather() it again, so each row is a unique plot_id by year combination.

gathered_genera <- genera_per_plot %>% 
  gather("year", "genera", -plot_id) 
# why does gather need columns to be in quotes? 

head(gathered_genera)

## # A tibble: 6 x 3
## # Groups:   plot_id [6]
##   plot_id year  genera
##     <dbl> <chr>  <int>
## 1       1 1977       2
## 2       2 1977       6
## 3       3 1977       5
## 4       4 1977       4
## 5       5 1977       4
## 6       6 1977       3

The surveys data set has two measurement columns: hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem.

First, use gather() to create a dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being gathered.

surveys_long <- surveys %>% 
  gather ("measurement", "value", hindfoot_length, weight)

str(surveys_long)

## tibble [69,572 × 13] (S3: tbl_df/tbl/data.frame)
##  $ record_id  : num [1:69572] 1 72 224 266 349 363 435 506 588 661 ...
##  $ month      : num [1:69572] 7 8 9 10 11 11 12 1 2 3 ...
##  $ day        : num [1:69572] 16 19 13 16 12 12 10 8 18 11 ...
##  $ year       : num [1:69572] 1977 1977 1977 1977 1977 ...
##  $ plot_id    : num [1:69572] 2 2 2 2 2 2 2 2 2 2 ...
##  $ species_id : chr [1:69572] "NL" "NL" "NL" "NL" ...
##  $ sex        : chr [1:69572] "M" "M" NA NA ...
##  $ genus      : chr [1:69572] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
##  $ species    : chr [1:69572] "albigula" "albigula" "albigula" "albigula" ...
##  $ taxa       : chr [1:69572] "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type  : chr [1:69572] "Control" "Control" "Control" "Control" ...
##  $ measurement: chr [1:69572] "hindfoot_length" "hindfoot_length" "hindfoot_length" "hindfoot_length" ...
##  $ value      : num [1:69572] 32 31 NA NA NA NA NA NA NA NA ...

This doubled the amount of rows. It added hindfootlength to one and then weight to another for each obs.

With this new data set, calculate the average of each measurement in each year for each different plot_type. Then spread() them into a data set with a column for hindfoot_length and weight. Hint: You only need to specify the key and value columns for spread()

measurements <- surveys_long %>% 
  group_by(year, measurement, plot_type) %>% 
  summarize(mean_value = mean(value, na.rm=T)) %>% 
  spread(measurement, mean_value)

## `summarise()` regrouping output by 'year', 'measurement' (override with `.groups` argument)

head(measurements)

## # A tibble: 6 x 4
## # Groups:   year [2]
##    year plot_type                 hindfoot_length weight
##   <dbl> <chr>                               <dbl>  <dbl>
## 1  1977 Control                              36.1   50.4
## 2  1977 Long-term Krat Exclosure             33.7   34.8
## 3  1977 Rodent Exclosure                     39.1   48.2
## 4  1977 Short-term Krat Exclosure            35.8   41.3
## 5  1977 Spectab exclosure                    37.2   47.1
## 6  1978 Control                              38.1   70.8

Exporting Data

First, remove Nas

surveys_complete <- surveys %>% 
  filter(!is.na(weight),
         !is.na(hindfoot_length),
         !is.na(sex))

Remove rare species; first make a subset of the most common species (50 or more observations)

species_counts <- surveys_complete %>%
  count(species_id) %>% 
  filter(n >= 50)

Then subset the complete data set with those species names

surveys_complete <- surveys_complete %>% 
  filter(species_id %in% species_counts$species_id)

Double check we have the right dimensions (30,463 x 13)

dim(surveys_complete)

## [1] 30463    13

Write new csv file

write_csv(surveys_complete, path = "generated_data/surveys_complete.csv")

Lab 3B: Manipulating, Analyzing, and Exporting Data with tidyverse

Alison Fowler

Differences between a dataframe and a tibble:

Filter and select:

Piping:

Challenge:

Mutate

Challenge:

Split-apply-combine data analysis and the summarize() function

Counting

Challenge:

Reshaping with gather and spread

Challenge:

Exporting Data