Prepro 1: Demo

Published

February 20, 2024

This demo’s source code can also be downloaded as an R Script (right click → Save Target As..)

Data types

Doubles

There are two different numeric data types in R:

  • double: floating-point number (e.g. 10.3, 7.3)
  • integer (e.g. 10, 7)

A double / floating point number is assigned to a variable as follows:

x <- 10.3

x
[1] 10.3
class(x)
[1] "numeric"
Note

Either <- or = can be used. However, the latter is also easily confused with ==.

y = 7.3
y
[1] 7.3

Integer

A number is only stored as an integer if it is explicitly defined as one (using as.integer() or L).

d <- 8L

class(d)
[1] "integer"

Boolean

sunny <- FALSE
dry <- TRUE

sunny & dry
[1] FALSE
e <- 3
f <- 6

e > f
[1] FALSE

Character

Character strings contain text.

fname <- "Andrea"
lname <- "Muster"
class(fname)
[1] "character"

Connecting / concatenating character strings

paste(fname, lname)
[1] "Andrea Muster"
paste(fname, lname, sep = ",")
[1] "Andrea,Muster"

Date / time

In most parts of the world, we use the Gregorian Calendar to communicate a point in time. In this system, we track time as years, months, days, hours, minutes and seconds after a specific event (Anno Domini, “in the year of the Lord”).

R, just as all other computer systems, do not store date / time information using years, months days etc. Instead, R stores the number of seconds after a given date (January 1st, 1970, which is also called unix epoch). This information is stored using the class POSIXct, which also helps us convert this number of seconds into more human readable information. On 01.02.2024 at 13:45, 1’706’791’500 have passed since the unix epoch, so to store this timestamp, R stores the number 1’706’791’500.

# We may have a timestamp saved as a character string
today_txt <- "2024-02-01 13:45:00"

# as.POSIXct converts the string to POSIXct:
today_posixct <- as.POSIXct(today_txt)

# When printing a posixct date to the console, it is human readable
today_posixct
[1] "2024-02-01 13:45:00 CET"
# To see the internally stored value (# of seconds), convert it to numeric:
as.numeric(today_posixct)
[1] 1706791500

If the character string is delivered in the above format (year-month-day hour:minute:second), as.POSIXct knows how to caluate the number of seconds since unix epoch. However, if the format is different, we have to tell R how to read our timestamp. This requires a special syntax, which is described in ?strptime.

date_txt <- "01.10.2017 15:15"

# converts character to POSIXct:
as.POSIXct(date_txt)
Error in as.POSIXlt.character(x, tz, ...): character string is not in a standard unambiguous format
date_posix <- as.POSIXct(date_txt, format = "%d.%m.%Y %H:%M")

date_posix
[1] "2017-10-01 15:15:00 CEST"

Theoretically, strftime can also be used to extract specific components from a date. However, the functions from lubridate are much simpler and we recommend you use these. Note how strftime always returns strings while lubridate returns more useful datatypes such as integers or factors.

1strftime(date_posix, format = "%m")
2strftime(date_posix, format = "%b")
3strftime(date_posix, format = "%B")
## [1] "10"
## [1] "Oct"
## [1] "October"
1
extracts the month as a number
2
extracts the month by name (abbreviated)
3
extracts the month by name (full)
library("lubridate")

1month(date_posix)
2month(date_posix, label = TRUE, abbr = TRUE)
3month(date_posix, label = TRUE, abbr = FALSE)
## [1] 10
## [1] Oct
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
## [1] October
## 12 Levels: January < February < March < April < May < June < ... < December
1
extracts the month as a number
2
extracts the month by name (abbreviated)
3
extracts the month by name (full)
Time is hard

Handling date / time is tricky. We recommend the following practices to make life easier:

  • Always store time as POSIXct, not as text.
  • Always store time together with its according date, never separately.
  • If you must extract time (e.g. to analyse daily patterns), store it as decimal time (e.g. store 15:45 as 15.75) in a numeric data type.
  • Try to be explicit about which timezone your data originates from
  • If your observation period is affected by switching to or from daylight saving time, think about converting time to UTC
  • Use lubridate rather than strftime()

Data structures

Vectors

Using c(), a set of values of the same data type can be assigned to a variable (as a vector).

vec <- c(10, 20, 33, 42, 54, 66, 77)
vec
[1] 10 20 33 42 54 66 77
# to extract the 5th element
vec[5]
[1] 54
# to extract elements 2 to 4
vec[2:4]
[1] 20 33 42

Lists

A list is a collection of objects that do not need to be the same data type.

mylist <- list("q", TRUE, 3.14)

The individual elements in a list can also have assigned names.

mylist2 <- list(fav_letter = "q", fav_boolean = TRUE, fav_number = 3.14)

mylist2
$fav_letter
[1] "q"

$fav_boolean
[1] TRUE

$fav_number
[1] 3.14

Data frames

If each entry in a list is the same length, this list can also be represented as a table, which is called a dataframe in R.

# note how the names become column names
as.data.frame(mylist2)
  fav_letter fav_boolean fav_number
1          q        TRUE       3.14

The data.frame function allows a table to be created without first having to create a list.

df <- data.frame(
  City = c("Zurich", "Geneva", "Basel", "Bern", "Lausanne"),
  Arrival = c(
    "1.1.2017 10:10", "5.1.2017 14:45",
    "8.1.2017 13:15", "17.1.2017 18:30", "22.1.2017 21:05"
  )
)

str(df)
'data.frame':   5 obs. of  2 variables:
 $ City   : chr  "Zurich" "Geneva" "Basel" "Bern" ...
 $ Arrival: chr  "1.1.2017 10:10" "5.1.2017 14:45" "8.1.2017 13:15" "17.1.2017 18:30" ...

The $ symbol can be used to query data:

df$City
[1] "Zurich"   "Geneva"   "Basel"    "Bern"     "Lausanne"

New columns can be added and existing ones can be changed:

df$Residents <- c(400000, 200000, 175000, 14000, 130000)
# A tibble: 5 × 3
  City     Arrival         Residents
  <chr>    <chr>               <dbl>
1 Zurich   1.1.2017 10:10     400000
2 Geneva   5.1.2017 14:45     200000
3 Basel    8.1.2017 13:15     175000
4 Bern     17.1.2017 18:30     14000
5 Lausanne 22.1.2017 21:05    130000

We need to convert the Arrival time to a time format (POSIXct).

# first, test the output of the "as.POSIXct"-function
as.POSIXct(df$Arrival, format = "%d.%m.%Y %H:%M")
[1] "2017-01-01 10:10:00 CET" "2017-01-05 14:45:00 CET"
[3] "2017-01-08 13:15:00 CET" "2017-01-17 18:30:00 CET"
[5] "2017-01-22 21:05:00 CET"
# if it works, we can save the output to a new column
df$Arrival_ct <- as.POSIXct(df$Arrival, format = "%d.%m.%Y %H:%M")


# We *could* overwrite the old column, but this is a destructive operation!

These columns can now help to create convenience variables. E.g., the arrival time can be derived from the Arrival column.

df$Arrival_day <- wday(df$Arrival_ct, label = TRUE, week_start = 1)

df$Arrival_day
[1] Sun Thu Sun Tue Sun
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun