The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:
The results from a base R function sometimes depend on the type of data.
Using R expressions in a non standard way, which can be confusing for new learners.
Hidden arguments, having default operations that new learners are not aware of.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# download data file from online
download.file(url = "https://ndownloader.figshare.com/files/2292169",
destfile = "data_raw/portal_data_joined.csv")
surveys <- read_csv("data_raw/portal_data_joined.csv")
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## plot_id = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
# notice this is not read.csv(); read_csv() is in the readr package
# inspect the data
str(surveys)
## tibble [34,786 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ record_id : num [1:34786] 1 72 224 266 349 363 435 506 588 661 ...
## $ month : num [1:34786] 7 8 9 10 11 11 12 1 2 3 ...
## $ day : num [1:34786] 16 19 13 16 12 12 10 8 18 11 ...
## $ year : num [1:34786] 1977 1977 1977 1977 1977 ...
## $ plot_id : num [1:34786] 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : chr [1:34786] "NL" "NL" "NL" "NL" ...
## $ sex : chr [1:34786] "M" "M" NA NA ...
## $ hindfoot_length: num [1:34786] 32 31 NA NA NA NA NA NA NA NA ...
## $ weight : num [1:34786] NA NA NA NA NA NA NA NA 218 NA ...
## $ genus : chr [1:34786] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
## $ species : chr [1:34786] "albigula" "albigula" "albigula" "albigula" ...
## $ taxa : chr [1:34786] "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr [1:34786] "Control" "Control" "Control" "Control" ...
## - attr(*, "spec")=
## .. cols(
## .. record_id = col_double(),
## .. month = col_double(),
## .. day = col_double(),
## .. year = col_double(),
## .. plot_id = col_double(),
## .. species_id = col_character(),
## .. sex = col_character(),
## .. hindfoot_length = col_double(),
## .. weight = col_double(),
## .. genus = col_character(),
## .. species = col_character(),
## .. taxa = col_character(),
## .. plot_type = col_character()
## .. )
In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
Columns of class character are never converted into factors.
# select certain columns
select(surveys, plot_id, species_id, weight)
## # A tibble: 34,786 x 3
## plot_id species_id weight
## <dbl> <chr> <dbl>
## 1 2 NL NA
## 2 2 NL NA
## 3 2 NL NA
## 4 2 NL NA
## 5 2 NL NA
## 6 2 NL NA
## 7 2 NL NA
## 8 2 NL NA
## 9 2 NL 218
## 10 2 NL NA
## # … with 34,776 more rows
# select all columns except certain ones
select(surveys, -record_id, -species_id)
## # A tibble: 34,786 x 11
## month day year plot_id sex hindfoot_length weight genus species taxa
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 7 16 1977 2 M 32 NA Neot… albigu… Rode…
## 2 8 19 1977 2 M 31 NA Neot… albigu… Rode…
## 3 9 13 1977 2 <NA> NA NA Neot… albigu… Rode…
## 4 10 16 1977 2 <NA> NA NA Neot… albigu… Rode…
## 5 11 12 1977 2 <NA> NA NA Neot… albigu… Rode…
## 6 11 12 1977 2 <NA> NA NA Neot… albigu… Rode…
## 7 12 10 1977 2 <NA> NA NA Neot… albigu… Rode…
## 8 1 8 1978 2 <NA> NA NA Neot… albigu… Rode…
## 9 2 18 1978 2 M NA 218 Neot… albigu… Rode…
## 10 3 11 1978 2 <NA> NA NA Neot… albigu… Rode…
## # … with 34,776 more rows, and 1 more variable: plot_type <chr>
# choose rows based on specific criterion
filter(surveys, year == 1995)
## # A tibble: 1,180 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 22314 6 7 1995 2 NL M 34 NA
## 2 22728 9 23 1995 2 NL F 32 165
## 3 22899 10 28 1995 2 NL F 32 171
## 4 23032 12 2 1995 2 NL F 33 NA
## 5 22003 1 11 1995 2 DM M 37 41
## 6 22042 2 4 1995 2 DM F 36 45
## 7 22044 2 4 1995 2 DM M 37 46
## 8 22105 3 4 1995 2 DM F 37 49
## 9 22109 3 4 1995 2 DM M 37 46
## 10 22168 4 1 1995 2 DM M 36 48
## # … with 1,170 more rows, and 4 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>
What if you want to select and filter at the same time?
There are three ways to do this: use intermediate steps, nested functions, or pipes.
# Intermediate steps:
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)
# Nest functions:
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
These can be difficult to read… Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
head(surveys_sml)
## # A tibble: 6 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.
surveys_sub <- surveys %>%
filter(year > 1995) %>%
select(year, sex, weight)
head(surveys_sub)
## # A tibble: 6 x 3
## year sex weight
## <dbl> <chr> <dbl>
## 1 1996 M 127
## 2 1996 F 130
## 3 1996 M 148
## 4 1997 M 147
## 5 1997 F 144
## 6 1997 M 193
allows us to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns.
# create a new column of weight in kg:
kg_added <- surveys %>%
mutate(weight_kg = weight/1000)
kg_and_lb_added <- surveys %>%
mutate(weight_kg = weight/1000,
weight_lb = weight_kg * 2.2)
# if we wanted to remove NAs in weight column:
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
## # A tibble: 6 x 14
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 588 2 18 1978 2 NL M NA 218
## 2 845 5 6 1978 2 NL M 32 204
## 3 990 6 9 1978 2 NL M NA 200
## 4 1164 8 5 1978 2 NL M 34 199
## 5 1261 9 4 1978 2 NL M 32 197
## 6 1453 11 5 1978 2 NL M NA 218
## # … with 5 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>, weight_kg <dbl>
Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_cm containing the hindfoot_length values converted to centimeters. In this hindfoot_cm column, there are no NAs and all values are less than 3.
new_HFL <- surveys %>%
mutate(hindfoot_cm = hindfoot_length / 10) %>%
filter(!is.na(hindfoot_cm)) %>%
filter(hindfoot_cm < 3) %>%
select(species_id, hindfoot_cm)
head(new_HFL)
## # A tibble: 6 x 2
## species_id hindfoot_cm
## <chr> <dbl>
## 1 NL 2.8
## 2 NL 2.1
## 3 NL 2.1
## 4 NL 2.9
## 5 NL 2.9
## 6 DM 2.5
Split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.
The group_by() function is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calcuate the summary statistics.
# to compute the mean weight by sex:
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
tail()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## sex mean_weight
## <chr> <dbl>
## 1 F 42.2
## 2 M 43.0
## 3 <NA> 64.7
We can see that the sex column contains NA values because some animals had escaped before their sex and body weights could be determined. The resulting mean_weight column does not contain NA but NaN (which refers to “Not a Number”) because mean() was called on a vector of NA values whileat the same time setting na.rm = TRUE. To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed first, we can omit na.rm = TRUE when computing the mean:
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 64 x 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 F BA 9.16
## 2 F DM 41.6
## 3 F DO 48.5
## 4 F DS 118.
## 5 F NL 154.
## 6 F OL 31.1
## 7 F OT 24.8
## 8 F OX 21
## 9 F PB 30.2
## 10 F PE 22.8
## # … with 54 more rows
# specify number of rows to display with print(n=)
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 64 x 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 F BA 9.16
## 2 F DM 41.6
## 3 F DO 48.5
## 4 F DS 118.
## 5 F NL 154.
## 6 F OL 31.1
## 7 F OT 24.8
## 8 F OX 21
## 9 F PB 30.2
## 10 F PE 22.8
## 11 F PF 7.97
## 12 F PH 30.8
## 13 F PL 19.3
## 14 F PM 22.1
## 15 F PP 17.2
## # … with 49 more rows
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 64 x 4
## # Groups: sex [3]
## sex species_id mean_weight min_weight
## <chr> <chr> <dbl> <dbl>
## 1 F BA 9.16 6
## 2 F DM 41.6 10
## 3 F DO 48.5 12
## 4 F DS 118. 45
## 5 F NL 154. 32
## 6 F OL 31.1 10
## 7 F OT 24.8 5
## 8 F OX 21 20
## 9 F PB 30.2 12
## 10 F PE 22.8 11
## # … with 54 more rows
# We can sort on min_weight to put the lighter species first
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 64 x 4
## # Groups: sex [3]
## sex species_id mean_weight min_weight
## <chr> <chr> <dbl> <dbl>
## 1 F PF 7.97 4
## 2 F RM 11.1 4
## 3 M PF 7.89 4
## 4 M PP 17.2 4
## 5 M RM 10.1 4
## 6 <NA> PF 6 4
## 7 F OT 24.8 5
## 8 F PP 17.2 5
## 9 F BA 9.16 6
## 10 M BA 7.36 6
## # … with 54 more rows
# and we can sort in descending order...
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(mean_weight))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 64 x 4
## # Groups: sex [3]
## sex species_id mean_weight min_weight
## <chr> <chr> <dbl> <dbl>
## 1 <NA> NL 168. 83
## 2 M NL 166. 30
## 3 F NL 154. 32
## 4 M SS 130 130
## 5 <NA> SH 130 130
## 6 M DS 122. 12
## 7 <NA> DS 120 78
## 8 F DS 118. 45
## 9 F SH 78.8 30
## 10 F SF 69 46
## # … with 54 more rows
We often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count().
# if we wanted to count the number of rows of data for each sex:
surveys %>%
count(sex)
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
# which is equivalent to:
surveys %>%
group_by(sex) %>%
summarize(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## sex count
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
# there is a sort argument for count()
surveys %>%
count(sex, sort = T)
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 M 17348
## 2 F 15690
## 3 <NA> 1748
# multiple factors: specify the order
surveys %>%
count(sex, species)
## # A tibble: 81 x 3
## sex species n
## <chr> <chr> <int>
## 1 F albigula 675
## 2 F baileyi 1646
## 3 F eremicus 579
## 4 F flavus 757
## 5 F fulvescens 57
## 6 F fulviventer 17
## 7 F hispidus 99
## 8 F leucogaster 475
## 9 F leucopus 16
## 10 F maniculatus 382
## # … with 71 more rows
surveys %>%
count(species, sex)
## # A tibble: 81 x 3
## species sex n
## <chr> <chr> <int>
## 1 albigula F 675
## 2 albigula M 502
## 3 albigula <NA> 75
## 4 audubonii <NA> 75
## 5 baileyi F 1646
## 6 baileyi M 1216
## 7 baileyi <NA> 29
## 8 bilineata <NA> 303
## 9 brunneicapillus <NA> 50
## 10 chlorurus <NA> 39
## # … with 71 more rows
# arrange species in alphabetical order and count in descending order
surveys %>%
count(sex, species) %>%
arrange(species, desc(n))
## # A tibble: 81 x 3
## sex species n
## <chr> <chr> <int>
## 1 F albigula 675
## 2 M albigula 502
## 3 <NA> albigula 75
## 4 <NA> audubonii 75
## 5 F baileyi 1646
## 6 M baileyi 1216
## 7 <NA> baileyi 29
## 8 <NA> bilineata 303
## 9 <NA> brunneicapillus 50
## 10 <NA> chlorurus 39
## # … with 71 more rows
surveys %>%
count(plot_type)
## # A tibble: 5 x 2
## plot_type n
## <chr> <int>
## 1 Control 15611
## 2 Long-term Krat Exclosure 5118
## 3 Rodent Exclosure 4233
## 4 Short-term Krat Exclosure 5906
## 5 Spectab exclosure 3918
spp_HFL <- surveys %>%
group_by(species_id) %>%
summarize(mean_length = mean(hindfoot_length),
min_length = min(hindfoot_length),
max_length = max(hindfoot_length))
## `summarise()` ungrouping output (override with `.groups` argument)
head(spp_HFL)
## # A tibble: 6 x 4
## species_id mean_length min_length max_length
## <chr> <dbl> <dbl> <dbl>
## 1 AB NA NA NA
## 2 AH NA NA NA
## 3 AS NA NA NA
## 4 BA NA NA NA
## 5 CB NA NA NA
## 6 CM NA NA NA
# Oh, it can't calculate means with NAs; let's remove...
spp_HFL <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
group_by(species_id) %>%
summarize(mean_length = mean(hindfoot_length),
min_length = min(hindfoot_length),
max_length = max(hindfoot_length),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(spp_HFL)
## # A tibble: 6 x 5
## species_id mean_length min_length max_length n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 AH 33 31 35 2
## 2 BA 13 6 16 45
## 3 DM 36.0 16 50 9972
## 4 DO 35.6 26 64 2887
## 5 DS 49.9 39 58 2132
## 6 NL 32.3 21 70 1074
spp_weights <- surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
filter(weight == max(weight)) %>% # filter, not summarize!
select(year, species_id, genus, weight) %>%
arrange(year)
head(spp_weights)
## # A tibble: 6 x 4
## # Groups: year [5]
## year species_id genus weight
## <dbl> <chr> <chr> <dbl>
## 1 1977 DS Dipodomys 149
## 2 1978 NL Neotoma 232
## 3 1978 NL Neotoma 232
## 4 1979 NL Neotoma 274
## 5 1980 NL Neotoma 243
## 6 1981 NL Neotoma 264
spread() takes three principal arguments: 1. the data 2. the key column variable whose values will become new column names 3. the value column whose values will fill the new column variables
further arguments include fill which, if set, fills in missing values with the value provided
use spread() to transform surveys to find the mean weight of each genus in each plot over the entire period.
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)
head(surveys_gw)
## # A tibble: 6 x 3
## # Groups: plot_id [1]
## plot_id genus mean_weight
## <dbl> <chr> <dbl>
## 1 1 Baiomys 7
## 2 1 Chaetodipus 22.2
## 3 1 Dipodomys 60.2
## 4 1 Neotoma 156.
## 5 1 Onychomys 27.7
## 6 1 Perognathus 9.62
str(surveys_gw)
## tibble [196 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ plot_id : num [1:196] 1 1 1 1 1 1 1 1 2 2 ...
## $ genus : chr [1:196] "Baiomys" "Chaetodipus" "Dipodomys" "Neotoma" ...
## $ mean_weight: num [1:196] 7 22.2 60.2 156.2 27.7 ...
## - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ .rows : list<int> [1:24]
## .. ..$ : int [1:8] 1 2 3 4 5 6 7 8
## .. ..$ : int [1:9] 9 10 11 12 13 14 15 16 17
## .. ..$ : int [1:9] 18 19 20 21 22 23 24 25 26
## .. ..$ : int [1:8] 27 28 29 30 31 32 33 34
## .. ..$ : int [1:9] 35 36 37 38 39 40 41 42 43
## .. ..$ : int [1:8] 44 45 46 47 48 49 50 51
## .. ..$ : int [1:7] 52 53 54 55 56 57 58
## .. ..$ : int [1:7] 59 60 61 62 63 64 65
## .. ..$ : int [1:8] 66 67 68 69 70 71 72 73
## .. ..$ : int [1:7] 74 75 76 77 78 79 80
## .. ..$ : int [1:8] 81 82 83 84 85 86 87 88
## .. ..$ : int [1:8] 89 90 91 92 93 94 95 96
## .. ..$ : int [1:8] 97 98 99 100 101 102 103 104
## .. ..$ : int [1:8] 105 106 107 108 109 110 111 112
## .. ..$ : int [1:8] 113 114 115 116 117 118 119 120
## .. ..$ : int [1:7] 121 122 123 124 125 126 127
## .. ..$ : int [1:8] 128 129 130 131 132 133 134 135
## .. ..$ : int [1:9] 136 137 138 139 140 141 142 143 144
## .. ..$ : int [1:9] 145 146 147 148 149 150 151 152 153
## .. ..$ : int [1:10] 154 155 156 157 158 159 160 161 162 163
## .. ..$ : int [1:9] 164 165 166 167 168 169 170 171 172
## .. ..$ : int [1:8] 173 174 175 176 177 178 179 180
## .. ..$ : int [1:8] 181 182 183 184 185 186 187 188
## .. ..$ : int [1:8] 189 190 191 192 193 194 195 196
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
Observations for each plot are spread across multiple rows, 196 obs of three variabes.
surveys_spread <- surveys_gw %>%
spread(key = genus, value = mean_weight)
head(surveys_spread)
## # A tibble: 6 x 11
## # Groups: plot_id [6]
## plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7 22.2 60.2 156. 27.7 9.62 22.2
## 2 2 6 25.1 55.7 169. 26.9 6.95 22.3
## 3 3 8.61 24.6 52.0 158. 26.0 7.51 21.4
## 4 4 NA 23.0 57.5 164. 28.1 7.82 22.6
## 5 5 7.75 18.0 51.1 190. 27.0 8.66 21.2
## 6 6 NA 24.9 58.6 180. 25.9 7.81 21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## # Spermophilus <dbl>
str(surveys_spread)
## tibble [24 × 11] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ plot_id : num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
## $ Baiomys : num [1:24] 7 6 8.61 NA 7.75 ...
## $ Chaetodipus : num [1:24] 22.2 25.1 24.6 23 18 ...
## $ Dipodomys : num [1:24] 60.2 55.7 52 57.5 51.1 ...
## $ Neotoma : num [1:24] 156 169 158 164 190 ...
## $ Onychomys : num [1:24] 27.7 26.9 26 28.1 27 ...
## $ Perognathus : num [1:24] 9.62 6.95 7.51 7.82 8.66 ...
## $ Peromyscus : num [1:24] 22.2 22.3 21.4 22.6 21.2 ...
## $ Reithrodontomys: num [1:24] 11.4 10.7 10.5 10.3 11.2 ...
## $ Sigmodon : num [1:24] NA 70.9 65.6 82 82.7 ...
## $ Spermophilus : num [1:24] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ .rows : list<int> [1:24]
## .. ..$ : int 1
## .. ..$ : int 2
## .. ..$ : int 3
## .. ..$ : int 4
## .. ..$ : int 5
## .. ..$ : int 6
## .. ..$ : int 7
## .. ..$ : int 8
## .. ..$ : int 9
## .. ..$ : int 10
## .. ..$ : int 11
## .. ..$ : int 12
## .. ..$ : int 13
## .. ..$ : int 14
## .. ..$ : int 15
## .. ..$ : int 16
## .. ..$ : int 17
## .. ..$ : int 18
## .. ..$ : int 19
## .. ..$ : int 20
## .. ..$ : int 21
## .. ..$ : int 22
## .. ..$ : int 23
## .. ..$ : int 24
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
surveys_gw %>%
spread(genus, mean_weight, fill = 0) %>%
head()
## # A tibble: 6 x 11
## # Groups: plot_id [6]
## plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7 22.2 60.2 156. 27.7 9.62 22.2
## 2 2 6 25.1 55.7 169. 26.9 6.95 22.3
## 3 3 8.61 24.6 52.0 158. 26.0 7.51 21.4
## 4 4 0 23.0 57.5 164. 28.1 7.82 22.6
## 5 5 7.75 18.0 51.1 190. 27.0 8.66 21.2
## 6 6 0 24.9 58.6 180. 25.9 7.81 21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## # Spermophilus <dbl>
gather() can be used to do the opposite; takes in four principle arguments: 1. the data 2. the key column variable we wish to create from column names. #. the values column variable we wish to create and fill with values associated with the key. 4. the names of the columns we use to fill the key variable (or to drop).
surveys_gather <- surveys_spread %>%
gather(key = "genus", value = "mean_weight", -plot_id)
head(surveys_gather)
## # A tibble: 6 x 3
## # Groups: plot_id [6]
## plot_id genus mean_weight
## <dbl> <chr> <dbl>
## 1 1 Baiomys 7
## 2 2 Baiomys 6
## 3 3 Baiomys 8.61
## 4 4 Baiomys NA
## 5 5 Baiomys 7.75
## 6 6 Baiomys NA
str(surveys_gather)
## tibble [240 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ plot_id : num [1:240] 1 2 3 4 5 6 7 8 9 10 ...
## $ genus : chr [1:240] "Baiomys" "Baiomys" "Baiomys" "Baiomys" ...
## $ mean_weight: num [1:240] 7 6 8.61 NA 7.75 ...
## - attr(*, "groups")= tibble [24 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ plot_id: num [1:24] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ .rows : list<int> [1:24]
## .. ..$ : int [1:10] 1 25 49 73 97 121 145 169 193 217
## .. ..$ : int [1:10] 2 26 50 74 98 122 146 170 194 218
## .. ..$ : int [1:10] 3 27 51 75 99 123 147 171 195 219
## .. ..$ : int [1:10] 4 28 52 76 100 124 148 172 196 220
## .. ..$ : int [1:10] 5 29 53 77 101 125 149 173 197 221
## .. ..$ : int [1:10] 6 30 54 78 102 126 150 174 198 222
## .. ..$ : int [1:10] 7 31 55 79 103 127 151 175 199 223
## .. ..$ : int [1:10] 8 32 56 80 104 128 152 176 200 224
## .. ..$ : int [1:10] 9 33 57 81 105 129 153 177 201 225
## .. ..$ : int [1:10] 10 34 58 82 106 130 154 178 202 226
## .. ..$ : int [1:10] 11 35 59 83 107 131 155 179 203 227
## .. ..$ : int [1:10] 12 36 60 84 108 132 156 180 204 228
## .. ..$ : int [1:10] 13 37 61 85 109 133 157 181 205 229
## .. ..$ : int [1:10] 14 38 62 86 110 134 158 182 206 230
## .. ..$ : int [1:10] 15 39 63 87 111 135 159 183 207 231
## .. ..$ : int [1:10] 16 40 64 88 112 136 160 184 208 232
## .. ..$ : int [1:10] 17 41 65 89 113 137 161 185 209 233
## .. ..$ : int [1:10] 18 42 66 90 114 138 162 186 210 234
## .. ..$ : int [1:10] 19 43 67 91 115 139 163 187 211 235
## .. ..$ : int [1:10] 20 44 68 92 116 140 164 188 212 236
## .. ..$ : int [1:10] 21 45 69 93 117 141 165 189 213 237
## .. ..$ : int [1:10] 22 46 70 94 118 142 166 190 214 238
## .. ..$ : int [1:10] 23 47 71 95 119 143 167 191 215 239
## .. ..$ : int [1:10] 24 48 72 96 120 144 168 192 216 240
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
# you can also specify what columns to include - but for some reason when I tried to change the range of genera, it got messed up
surveys_spread %>%
gather(key = "genus", value = "mean_weight", Baiomys:Spermophilus)
## # A tibble: 240 x 3
## # Groups: plot_id [24]
## plot_id genus mean_weight
## <dbl> <chr> <dbl>
## 1 1 Baiomys 7
## 2 2 Baiomys 6
## 3 3 Baiomys 8.61
## 4 4 Baiomys NA
## 5 5 Baiomys 7.75
## 6 6 Baiomys NA
## 7 7 Baiomys NA
## 8 8 Baiomys NA
## 9 9 Baiomys NA
## 10 10 Baiomys NA
## # … with 230 more rows
genera_per_plot <- surveys %>%
group_by(plot_id, year) %>%
summarize(genera = n_distinct(genus))
## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)
# I guess it knows to do distinct per plot_id and not year because it is listed first?
# now lets add spread
genera_per_plot <- surveys %>%
group_by(plot_id, year) %>%
summarize(genera = n_distinct(genus)) %>%
spread(year, genera)
## `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)
head(genera_per_plot)
## # A tibble: 6 x 27
## # Groups: plot_id [6]
## plot_id `1977` `1978` `1979` `1980` `1981` `1982` `1983` `1984` `1985` `1986`
## <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 2 3 4 7 5 6 7 6 4 3
## 2 2 6 6 6 8 5 9 9 9 6 4
## 3 3 5 6 4 6 6 8 10 11 7 6
## 4 4 4 4 3 4 5 4 6 3 4 3
## 5 5 4 3 2 5 4 6 7 7 3 1
## 6 6 3 4 3 4 5 9 9 7 5 6
## # … with 16 more variables: `1987` <int>, `1988` <int>, `1989` <int>,
## # `1990` <int>, `1991` <int>, `1992` <int>, `1993` <int>, `1994` <int>,
## # `1995` <int>, `1996` <int>, `1997` <int>, `1998` <int>, `1999` <int>,
## # `2000` <int>, `2001` <int>, `2002` <int>
gathered_genera <- genera_per_plot %>%
gather("year", "genera", -plot_id)
# why does gather need columns to be in quotes?
head(gathered_genera)
## # A tibble: 6 x 3
## # Groups: plot_id [6]
## plot_id year genera
## <dbl> <chr> <int>
## 1 1 1977 2
## 2 2 1977 6
## 3 3 1977 5
## 4 4 1977 4
## 5 5 1977 4
## 6 6 1977 3
First, use gather() to create a dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being gathered.
surveys_long <- surveys %>%
gather ("measurement", "value", hindfoot_length, weight)
str(surveys_long)
## tibble [69,572 × 13] (S3: tbl_df/tbl/data.frame)
## $ record_id : num [1:69572] 1 72 224 266 349 363 435 506 588 661 ...
## $ month : num [1:69572] 7 8 9 10 11 11 12 1 2 3 ...
## $ day : num [1:69572] 16 19 13 16 12 12 10 8 18 11 ...
## $ year : num [1:69572] 1977 1977 1977 1977 1977 ...
## $ plot_id : num [1:69572] 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : chr [1:69572] "NL" "NL" "NL" "NL" ...
## $ sex : chr [1:69572] "M" "M" NA NA ...
## $ genus : chr [1:69572] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
## $ species : chr [1:69572] "albigula" "albigula" "albigula" "albigula" ...
## $ taxa : chr [1:69572] "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr [1:69572] "Control" "Control" "Control" "Control" ...
## $ measurement: chr [1:69572] "hindfoot_length" "hindfoot_length" "hindfoot_length" "hindfoot_length" ...
## $ value : num [1:69572] 32 31 NA NA NA NA NA NA NA NA ...
This doubled the amount of rows. It added hindfootlength to one and then weight to another for each obs.
measurements <- surveys_long %>%
group_by(year, measurement, plot_type) %>%
summarize(mean_value = mean(value, na.rm=T)) %>%
spread(measurement, mean_value)
## `summarise()` regrouping output by 'year', 'measurement' (override with `.groups` argument)
head(measurements)
## # A tibble: 6 x 4
## # Groups: year [2]
## year plot_type hindfoot_length weight
## <dbl> <chr> <dbl> <dbl>
## 1 1977 Control 36.1 50.4
## 2 1977 Long-term Krat Exclosure 33.7 34.8
## 3 1977 Rodent Exclosure 39.1 48.2
## 4 1977 Short-term Krat Exclosure 35.8 41.3
## 5 1977 Spectab exclosure 37.2 47.1
## 6 1978 Control 38.1 70.8
First, remove Nas
surveys_complete <- surveys %>%
filter(!is.na(weight),
!is.na(hindfoot_length),
!is.na(sex))
Remove rare species; first make a subset of the most common species (50 or more observations)
species_counts <- surveys_complete %>%
count(species_id) %>%
filter(n >= 50)
Then subset the complete data set with those species names
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
Double check we have the right dimensions (30,463 x 13)
dim(surveys_complete)
## [1] 30463 13
Write new csv file
write_csv(surveys_complete, path = "generated_data/surveys_complete.csv")