Prepro 3: Demo

Published

March 5, 2024

In this demo, we will introduce other tools from the Tidyverse and explain them using examples. The tidyverse tools make dealing with data much easier and have now become a must have when dealing with data in R.

We cannot show you all the possibilities of tidyverse. Therefore, we will focus on the most important components and also introduce additional functionalities that we often use but may not yet be known to you. If you want to delve deeper into the topic, you should read Wickham, Çetinkaya-Rundel, and Grolemund (2023), which is available online: https://r4ds.hadley.nz/

We will need the following packages:

library("dplyr")
library("tidyr")
library("lubridate")
library("readr")
library("ggplot2")

Load data

Lets load the weather data (source MeteoSchweiz) from the last exercise.

weather <- read_delim("datasets/prepro/weather.csv", ",")

weather <- weather |>
  mutate(
    stn = as.factor(stn),
    time = as.POSIXct(as.character(time), format = "%Y%m%d%H")
  )

Calculate values

We would like to calculate the average of all measured temperature values. To do this, we could use the following command:

mean(weather$tre200h0, na.rm = TRUE)
## [1] 6.324744

The option na.rm = T means that NA values should be excluded from the calculation.

Various values can be calculated using the same approach (e.g. the maximum (max()), minimum (min()), median (median()) and much more).

This approach only works well if we want to calculate values across all observations for a variable (column). As soon as we want to group the observations, it becomes difficult. For example, if we want to calculate the average temperature per month.

Convenience Variables

To solve this task, the month must first be extracted (the month is the convenience variable). For this we need the lubridate::month() function.

Now the month convenience variable can be created. Without using dpylr, a new column can be added as follows:

weather$month <- month(weather$time)

With dplyr the same command looks like this:

weather <- mutate(weather, month = month(time))

The main advantage of dplyr is not yet apparent at this point. However, this will become clear later.

Calculate values from groups

To calculate the average value per month with base R, you can first create a subset with [] and calculate the average value as follows:

mean(weather$tre200h0[weather$month == 1], na.rm = TRUE)
## [1] -1.963239

We have to repeat this every month, which of course is very cumbersome. That is why we use the dplyr package. This, allows us to complete the task (calculate temperature means per month) as follows:

summarise(group_by(weather, month), temp_average = mean(tre200h0, na.rm = TRUE))
## # A tibble: 12 × 2
##    month temp_average
##    <dbl>        <dbl>
##  1     1       -1.96 
##  2     2        0.355
##  3     3        2.97 
##  4     4        4.20 
##  5     5       11.0  
##  6     6       12.4  
##  7     7       13.0  
##  8     8       15.0  
##  9     9        9.49 
## 10    10        8.79 
## 11    11        1.21 
## 12    12       -0.898

Concatenate vs. Nest

Translated into English, the above operation is as follows:

  1. Take the weather dataset
  2. Form groups per year (group_by(weather, year))
  3. Calculate the mean temperature (mean(tre200h0))

The translation from R -> English looks different because we read the operation in a concatenated form in English (operation 1 → 2 → 3) while the computer reads it as a nested operation 3(2(1)). To make R closer to English, you can use the |> operator (see Wickham, Çetinkaya-Rundel, and Grolemund 2023, chap. 4.3).

# 1 take the dataset "weather"
# 2 form groups per month
# 3 calculate the average temperature

summarise(group_by(weather, month), temp_average = mean(tre200h0))
#                  \__1__/
#         \___________2__________/
# \___________________3________________________________________/

# becomes:

weather |>                                 # 1
  group_by(month) |>                       # 2
  summarise(temp_average = mean(tre200h0)) # 3

This concatenation by means of |> (called pipe) makes the code a lot easier to write and read, and we will use it in the following exercises. Pipe is provided as part of the magrittr package and installed with dplyr. There are several online tutorials about dplyr (see Wickham, Çetinkaya-Rundel, and Grolemund 2023, Part “Transform” or this youtube tutorial)

Therefore, we will not explain all of these tools in full detail. Instead we will just focus on the important differences for two main functions in dpylr: mutate() and summarise().

  • summarise() summarises a data set. The number of observations (rows) is reduced to the number of groups (e.g., one summarised observation (row) per year). In addition, the number of variables (columns) is reduced to those specified in the “summarise” function (e.g., temp_mean).
  • mutate adds additional variables (columns) to a data.frame (see example below).
# Maximum and minimum temperature per calendar week
weather_summary <- weather |>               # 1) take the dataset "weather"
  filter(month == 1) |>                     # 2) filter for the month of January
  mutate(day = day(time)) |>                # 3) create a new column "day"
  group_by(day) |>                          # 4) Use the new column to form groups
  summarise(
    temp_max = max(tre200h0, na.rm = TRUE), # 5) Calculate the maximum
    temp_min = min(tre200h0, na.rm = TRUE)  # 6) Calculate the minimum
  )

weather_summary
## # A tibble: 31 × 3
##      day temp_max temp_min
##    <int>    <dbl>    <dbl>
##  1     1      5.8     -4.4
##  2     2      2.8     -4.3
##  3     3      4.2     -3.1
##  4     4      4.7     -2.8
##  5     5     11.4     -0.6
##  6     6      6.7     -1.6
##  7     7      2.9     -2.8
##  8     8      0.2     -3.6
##  9     9      2.1     -8.8
## 10    10      1.6     -2.4
## # ℹ 21 more rows