PrePro 1: Exercise

Published

February 20, 2024

Working with RStudio “Project”

We recommend using “Projects” within RStudio. RStudio then creates a folder for each project in which the project file is stored (file extension .rproj). If Rscripts are loaded or generated within the project, they are then also stored in the project folder. You can find out more about RStudio Projects here.

There are several benefits to using Projects. You can:

specify the Working Directory without using an explicit path (setwd()). This is useful because the path can change (when collaborating with other users, or executing the script at a later date)
automatically cache open scripts and restore open scripts in the next session
set different project-specific options
use version control systems (e.g., git)

Working with libraries / packages

R packages have become indispensable. The vast majority of packages are hosted on CRAN and can be easily installed using install.packages(). A very important collection of packages is being developed by RStudio. Tidyverse offers a range of packages that make everyday life enormously easier. We will discuss the “Tidy” universe in more detail later. For now, we can simply install the most important packages from tidyverse (we will only be using a small selection of them today).

There are two ways to use a package in R:

either you load it at the beginning of the R-session by means of library("dplyr") (without quotation marks).
or you call a function by prefixing it with the package name and two colons. dplyr::filter() calls the filter() function from the dplyr package.

The second method is particularly useful if two different functions with the same name exist in different packages. For example, filter() exists as a function in both the dplyr and stats packages. This is called masking.

To get started, we’ll load the necessary packages:

library("readr")
library("lubridate")

Task 1

Create a data.frame with the following data. Tipp: Create a vector for each column first.

Sample Solution

df <- data.frame(
  Species = c("Fox", "Bear", "Rabbit", "Moose"),
  Number = c(2, 5, 1, 3),
  Weight = c(4.4, 40.3, 1.1, 120),
  Sex = c("m", "f", "m", "m"),
  Description = c("Reddish", "Brown, large", "Small, with long ears", "Long legs, shovel antlers")
)

Species	Number	Weight	Sex	Description
Fox	2	4.4	m	Reddish
Bear	5	40.3	f	Brown, large
Rabbit	1	1.1	m	Small, with long ears
Moose	3	120.0	m	Long legs, shovel antlers

Task 2

What types of data were automatically accepted in the last task? Check this using str(), see whether they make sense and convert where necessary.

Sample Solution

str(df)
## 'data.frame':    4 obs. of  5 variables:
##  $ Species    : chr  "Fox" "Bear" "Rabbit" "Moose"
##  $ Number     : num  2 5 1 3
##  $ Weight     : num  4.4 40.3 1.1 120
##  $ Sex        : chr  "m" "f" "m" "m"
##  $ Description: chr  "Reddish" "Brown, large" "Small, with long ears" "Long legs, shovel antlers"
typeof(df$Number)
## [1] "double"
# Number was interpreted as `double`, but it is actually an `integer`.

df$Number <- as.integer(df$Number)

# We know sex only has two options:
df$Sex <- factor(df$Sex, levels = c("m","f"))

Task 3

On Moodle, you will find a folder called Datasets. Download the file and move it in your project folder. Import the weather.csv file. If you use the RStudio GUI for this, save the import command in your R-Script. Please use a relative path (i.e., not a path starting with C:\, or similar).)

Note

I use readr to import csv files and the read_delim function (with underscore) as an alternative to read.csv or read.delim (with a dot). However, this is a personal preference¹, and it is up to you which function you use. Remember that the two functions require slightly different parameters.

Sample Solution

weather <- read_delim("datasets/prepro/weather.csv", ",")

stn	time	tre200h0
ABO	2000010100	-2.6
ABO	2000010101	-2.5
ABO	2000010102	-3.1
ABO	2000010103	-2.4
ABO	2000010104	-2.5
ABO	2000010105	-3.0
ABO	2000010106	-3.7
ABO	2000010107	-4.4
ABO	2000010108	-4.1
ABO	2000010109	-4.1

Task 4

Check the feedback from read_delim(). Have the data been interpreted correctly?

Sample Solution

# The 'time' column was interpreted as 'integer'. However, it is 
# obviously a time indication.

Task 5

The time column is a date/time with a format of YYYYMMDDHH ( see meta.txt). In order for R to recognise the data in this column as date/time, it must be in the correct format (POSIXct). Therefore, we must tell R what the current format is. Use as.POSIXct() to read the column into R, remembering to specify both format and tz.

Tip

If no time zone is set, as.POSIXct() sets a default (based on sys.timezone()). In our case, however, these are values in UTC (see metadata.csv)
as.POSIXct requires a character input: If you receive the error message 'origin' must be supplied (or similar), you have probably tried to input a numeric into the function with.

Sample Solution

weather$time <- as.POSIXct(as.character(weather$time), format = "%Y%m%d%H", tz = "UTC")

The new table should look like this
stn	time	tre200h0
ABO	2000-01-01 00:00:00	-2.6
ABO	2000-01-01 01:00:00	-2.5
ABO	2000-01-01 02:00:00	-3.1
ABO	2000-01-01 03:00:00	-2.4
ABO	2000-01-01 04:00:00	-2.5
ABO	2000-01-01 05:00:00	-3.0
ABO	2000-01-01 06:00:00	-3.7
ABO	2000-01-01 07:00:00	-4.4
ABO	2000-01-01 08:00:00	-4.1
ABO	2000-01-01 09:00:00	-4.1

Task 6

Create two new columns for day of week (Monday, Tuesday, etc) and calendar week. Use the newly created POSIXct column and a suitable function from lubridate.

Sample Solution

weather$weekday <- wday(weather$time, label = T)
weather$week <- week(weather$time)

stn	time	tre200h0	weekday	week
ABO	2000-01-01 00:00:00	-2.6	Sat	1
ABO	2000-01-01 01:00:00	-2.5	Sat	1
ABO	2000-01-01 02:00:00	-3.1	Sat	1
ABO	2000-01-01 03:00:00	-2.4	Sat	1
ABO	2000-01-01 04:00:00	-2.5	Sat	1
ABO	2000-01-01 05:00:00	-3.0	Sat	1
ABO	2000-01-01 06:00:00	-3.7	Sat	1
ABO	2000-01-01 07:00:00	-4.4	Sat	1
ABO	2000-01-01 08:00:00	-4.1	Sat	1
ABO	2000-01-01 09:00:00	-4.1	Sat	1

Advantages of read_delim over read.csv: https://stackoverflow.com/a/60374974/4139249 ↩︎