PrePro 1: Exercise

Published

February 20, 2024

Working with RStudio “Project”

We recommend using “Projects” within RStudio. RStudio then creates a folder for each project in which the project file is stored (file extension .rproj). If Rscripts are loaded or generated within the project, they are then also stored in the project folder. You can find out more about RStudio Projects here.

There are several benefits to using Projects. You can:

  • specify the Working Directory without using an explicit path (setwd()). This is useful because the path can change (when collaborating with other users, or executing the script at a later date)
  • automatically cache open scripts and restore open scripts in the next session
  • set different project-specific options
  • use version control systems (e.g., git)

Task 1

Create a data.frame with the following data. Tipp: Create a vector for each column first.

Sample Solution
df <- data.frame(
  Species = c("Fox", "Bear", "Rabbit", "Moose"),
  Number = c(2, 5, 1, 3),
  Weight = c(4.4, 40.3, 1.1, 120),
  Sex = c("m", "f", "m", "m"),
  Description = c("Reddish", "Brown, large", "Small, with long ears", "Long legs, shovel antlers")
)
Species Number Weight Sex Description
Fox 2 4.4 m Reddish
Bear 5 40.3 f Brown, large
Rabbit 1 1.1 m Small, with long ears
Moose 3 120.0 m Long legs, shovel antlers

Task 2

What types of data were automatically accepted in the last task? Check this using str(), see whether they make sense and convert where necessary.

Sample Solution
str(df)
## 'data.frame':    4 obs. of  5 variables:
##  $ Species    : chr  "Fox" "Bear" "Rabbit" "Moose"
##  $ Number     : num  2 5 1 3
##  $ Weight     : num  4.4 40.3 1.1 120
##  $ Sex        : chr  "m" "f" "m" "m"
##  $ Description: chr  "Reddish" "Brown, large" "Small, with long ears" "Long legs, shovel antlers"
typeof(df$Number)
## [1] "double"
# Number was interpreted as `double`, but it is actually an `integer`.

df$Number <- as.integer(df$Number)

# We know sex only has two options:
df$Sex <- factor(df$Sex, levels = c("m","f"))

Input: Libraries / packages

Libraries (aka packages) are are “extensions” to the basic R functionality. R packages have become indispensable to using R. The vast majority of packages are hosted on CRAN and can be easily installed using install.packages("packagename"). This installation is done once. To use the library, you must load it into the current R session using library(packagename).

E.g. To import data, we recommend using the readr package1. Install the package using the command install.package("readr"). To use the package, load it into the current R session using library("readr").

Task 3

On Moodle, you will find a folder called Datasets. Download the file and move it in your project folder. Import the weather.csv file. If you use the RStudio GUI for this, save the import command in your R-Script. Please use a relative path (i.e., not a path starting with C:\, or similar).)

Sample Solution

library("readr")


weather <- read_delim("datasets/prepro/weather.csv", ",")
stn time tre200h0
ABO 2000010100 -2.6
ABO 2000010101 -2.5
ABO 2000010102 -3.1
ABO 2000010103 -2.4
ABO 2000010104 -2.5
ABO 2000010105 -3.0
ABO 2000010106 -3.7
ABO 2000010107 -4.4
ABO 2000010108 -4.1
ABO 2000010109 -4.1

Task 4

Have a look at your dataset in the console. Have the data been interpreted correctly?

Sample Solution
# The 'time' column was interpreted as 'integer'. However, it is 
# obviously a time indication.

Task 5

The time column is a date/time with a format of YYYYMMDDHH. In order for R to recognise the data in this column as date/time, it must be in the correct format (POSIXct). Therefore, we must tell R what the current format is. Use as.POSIXct() to read the column into R, remembering to specify both format and tz.

Tip
  • If no time zone is set, as.POSIXct() sets a default (based on sys.timezone()). In our case, however, these are values in UTC (see metadata.csv)
  • as.POSIXct requires a character input: If you receive the error message 'origin' must be supplied (or similar), you have probably tried to input a numeric into the function with.
Sample Solution
weather$time <- as.POSIXct(as.character(weather$time), format = "%Y%m%d%H", tz = "UTC")
The new table should look like this
stn time tre200h0
ABO 2000-01-01 00:00:00 -2.6
ABO 2000-01-01 01:00:00 -2.5
ABO 2000-01-01 02:00:00 -3.1
ABO 2000-01-01 03:00:00 -2.4
ABO 2000-01-01 04:00:00 -2.5
ABO 2000-01-01 05:00:00 -3.0
ABO 2000-01-01 06:00:00 -3.7
ABO 2000-01-01 07:00:00 -4.4
ABO 2000-01-01 08:00:00 -4.1
ABO 2000-01-01 09:00:00 -4.1

Task 6

Create two new columns for day of week (Monday, Tuesday, etc) and calendar week. Use the newly created POSIXct column and a suitable function from lubridate.

Sample Solution

library("lubridate")


weather$weekday <- wday(weather$time, label = T)
weather$week <- week(weather$time)
stn time tre200h0 weekday week
ABO 2000-01-01 00:00:00 -2.6 Sat 1
ABO 2000-01-01 01:00:00 -2.5 Sat 1
ABO 2000-01-01 02:00:00 -3.1 Sat 1
ABO 2000-01-01 03:00:00 -2.4 Sat 1
ABO 2000-01-01 04:00:00 -2.5 Sat 1
ABO 2000-01-01 05:00:00 -3.0 Sat 1
ABO 2000-01-01 06:00:00 -3.7 Sat 1
ABO 2000-01-01 07:00:00 -4.4 Sat 1
ABO 2000-01-01 08:00:00 -4.1 Sat 1
ABO 2000-01-01 09:00:00 -4.1 Sat 1

  1. Advantages of read_delim over read.csv: https://stackoverflow.com/a/60374974/4139249↩︎