library("readr")
library("lubridate")
PrePro 1: Exercise
Working with RStudio “Project”
We recommend using “Projects” within RStudio. RStudio then creates a folder for each project in which the project file is stored (file extension .rproj
). If Rscripts are loaded or generated within the project, they are then also stored in the project folder. You can find out more about RStudio Projects here.
There are several benefits to using Projects. You can:
- specify the Working Directory without using an explicit path (
setwd())
. This is useful because the path can change (when collaborating with other users, or executing the script at a later date) - automatically cache open scripts and restore open scripts in the next session
- set different project-specific options
- use version control systems (e.g., git)
Working with libraries / packages
R packages have become indispensable. The vast majority of packages are hosted on CRAN and can be easily installed using install.packages()
. A very important collection of packages is being developed by RStudio. Tidyverse offers a range of packages that make everyday life enormously easier. We will discuss the “Tidy” universe in more detail later. For now, we can simply install the most important packages from tidyverse
(we will only be using a small selection of them today).
There are two ways to use a package
in R:
- either you load it at the beginning of the R-session by means of
library("dplyr")
(without quotation marks). - or you call a
function
by prefixing it with the package name and two colons.dplyr::filter()
calls thefilter()
function from thedplyr
package.
The second method is particularly useful if two different functions with the same name exist in different packages. For example, filter()
exists as a function in both the dplyr
and stats
packages. This is called masking.
To get started, we’ll load the necessary packages:
Task 1
Create a data.frame
with the following data. Tipp: Create a vector for each column first.
Sample Solution
<- data.frame(
df Species = c("Fox", "Bear", "Rabbit", "Moose"),
Number = c(2, 5, 1, 3),
Weight = c(4.4, 40.3, 1.1, 120),
Sex = c("m", "f", "m", "m"),
Description = c("Reddish", "Brown, large", "Small, with long ears", "Long legs, shovel antlers")
)
Species | Number | Weight | Sex | Description |
---|---|---|---|---|
Fox | 2 | 4.4 | m | Reddish |
Bear | 5 | 40.3 | f | Brown, large |
Rabbit | 1 | 1.1 | m | Small, with long ears |
Moose | 3 | 120.0 | m | Long legs, shovel antlers |
Task 2
What types of data were automatically accepted in the last task? Check this using str()
, see whether they make sense and convert where necessary.
Sample Solution
str(df)
## 'data.frame': 4 obs. of 5 variables:
## $ Species : chr "Fox" "Bear" "Rabbit" "Moose"
## $ Number : num 2 5 1 3
## $ Weight : num 4.4 40.3 1.1 120
## $ Sex : chr "m" "f" "m" "m"
## $ Description: chr "Reddish" "Brown, large" "Small, with long ears" "Long legs, shovel antlers"
typeof(df$Number)
## [1] "double"
# Number was interpreted as `double`, but it is actually an `integer`.
$Number <- as.integer(df$Number)
df
# We know sex only has two options:
$Sex <- factor(df$Sex, levels = c("m","f")) df
Task 3
On Moodle, you will find a folder called Datasets. Download the file and move it in your project folder. Import the weather.csv
file. If you use the RStudio GUI for this, save the import command in your R-Script. Please use a relative path (i.e., not a path starting with C:\
, or similar).)
I use readr
to import csv files and the read_delim
function (with underscore) as an alternative to read.csv
or read.delim
(with a dot). However, this is a personal preference1, and it is up to you which function you use. Remember that the two functions require slightly different parameters.
Sample Solution
<- read_delim("datasets/prepro/weather.csv", ",") weather
stn | time | tre200h0 |
---|---|---|
ABO | 2000010100 | -2.6 |
ABO | 2000010101 | -2.5 |
ABO | 2000010102 | -3.1 |
ABO | 2000010103 | -2.4 |
ABO | 2000010104 | -2.5 |
ABO | 2000010105 | -3.0 |
ABO | 2000010106 | -3.7 |
ABO | 2000010107 | -4.4 |
ABO | 2000010108 | -4.1 |
ABO | 2000010109 | -4.1 |
Task 4
Check the feedback from read_delim()
. Have the data been interpreted correctly?
Sample Solution
# The 'time' column was interpreted as 'integer'. However, it is
# obviously a time indication.
Task 5
The time
column is a date/time with a format of YYYYMMDDHH ( see meta.txt). In order for R to recognise the data in this column as date/time, it must be in the correct format (POSIXct
). Therefore, we must tell R what the current format is. Use as.POSIXct()
to read the column into R, remembering to specify both format
and tz
.
- If no time zone is set,
as.POSIXct()
sets a default (based onsys.timezone()
). In our case, however, these are values in UTC (see metadata.csv) - as.POSIXct requires a
character
input: If you receive the error message'origin' must be supplied
(or similar), you have probably tried to input anumeric
into the function with.
Sample Solution
$time <- as.POSIXct(as.character(weather$time), format = "%Y%m%d%H", tz = "UTC") weather
stn | time | tre200h0 |
---|---|---|
ABO | 2000-01-01 00:00:00 | -2.6 |
ABO | 2000-01-01 01:00:00 | -2.5 |
ABO | 2000-01-01 02:00:00 | -3.1 |
ABO | 2000-01-01 03:00:00 | -2.4 |
ABO | 2000-01-01 04:00:00 | -2.5 |
ABO | 2000-01-01 05:00:00 | -3.0 |
ABO | 2000-01-01 06:00:00 | -3.7 |
ABO | 2000-01-01 07:00:00 | -4.4 |
ABO | 2000-01-01 08:00:00 | -4.1 |
ABO | 2000-01-01 09:00:00 | -4.1 |
Task 6
Create two new columns for day of week (Monday, Tuesday, etc) and calendar week. Use the newly created POSIXct
column and a suitable function from lubridate
.
Sample Solution
$weekday <- wday(weather$time, label = T)
weather$week <- week(weather$time) weather
stn | time | tre200h0 | weekday | week |
---|---|---|---|---|
ABO | 2000-01-01 00:00:00 | -2.6 | Sat | 1 |
ABO | 2000-01-01 01:00:00 | -2.5 | Sat | 1 |
ABO | 2000-01-01 02:00:00 | -3.1 | Sat | 1 |
ABO | 2000-01-01 03:00:00 | -2.4 | Sat | 1 |
ABO | 2000-01-01 04:00:00 | -2.5 | Sat | 1 |
ABO | 2000-01-01 05:00:00 | -3.0 | Sat | 1 |
ABO | 2000-01-01 06:00:00 | -3.7 | Sat | 1 |
ABO | 2000-01-01 07:00:00 | -4.4 | Sat | 1 |
ABO | 2000-01-01 08:00:00 | -4.1 | Sat | 1 |
ABO | 2000-01-01 09:00:00 | -4.1 | Sat | 1 |
Advantages of
read_delim
overread.csv
: https://stackoverflow.com/a/60374974/4139249↩︎