Prepro 2: Exercise B

Published

February 27, 2024

Task 1

You have data from three sensors (sensor1.csv, sensor2.csv, sensor3.csv). Read in the data sets using the library readr.

Sample Solution


library("readr")


sensor1 <- read_delim("datasets/prepro/sensor1.csv", ";")
sensor2 <- read_delim("datasets/prepro/sensor2.csv", ";")
sensor3 <- read_delim("datasets/prepro/sensor3.csv", ";")

Task 2

From the 3 data frames, create a single data frame that looks like the one shown below. Use two joins from dplyr to connect 3 data.frames. Then tidy up the column names (how can we do that?).

Sample Solution


library("dplyr")


sensor1_2 <- full_join(sensor1, sensor2, "Datetime")

sensor1_2 <- rename(sensor1_2, sensor1 = Temp.x, sensor2 = Temp.y)

sensor_all <- full_join(sensor1_2, sensor3, by = "Datetime")

sensor_all <- rename(sensor_all, sensor3 = Temp)

Datetime	sensor1	sensor2	sensor3
16102017_1800	23.5	13.5	26.5
17102017_1800	25.4	24.4	24.4
18102017_1800	12.4	22.4	13.4
19102017_1800	5.4	12.4	7.4
23102017_1800	23.5	13.5	NA
24102017_1800	21.3	11.3	NA

Task 3

Import the sensor_fail.csv file into R.

Sample Solution

sensor_fail <- read_delim("datasets/prepro/sensor_fail.csv", delim = ";")

sensor_fail.csv has a variable SensorStatus: 1 means the sensor is measuring, 0 means the sensor is not measuring. If sensor status = 0, the Temp = 0 value is incorrect. It should be NA (not available). Correct the dataset accordingly.

Sensor	Temp	Hum_%	Datetime	SensorStatus
Sen102	0.6	98	16102017_1800	1
Sen102	0.3	96	17102017_1800	1
Sen102	0.0	87	18102017_1800	1
Sen102	0.0	86	19102017_1800	0
Sen102	0.0	98	23102017_1800	0
Sen102	0.0	98	24102017_1800	0
Sen102	0.0	96	25102017_1800	1
Sen103	-0.3	87	26102017_1800	1
Sen103	-0.7	98	27102017_1800	1
Sen103	-1.2	98	28102017_1800	1

Sample Solution

# with base-R:
sensor_fail$Temp_correct[sensor_fail$SensorStatus == 0] <- NA
sensor_fail$Temp_correct[sensor_fail$SensorStatus != 0] <- sensor_fail$Temp # Warning message can be ignored.

# the same with dplyr:
sensor_fail <- sensor_fail |>
  mutate(Temp_correct = ifelse(SensorStatus == 0, NA, Temp))

Task 4

Why does it matter if 0 or NA is recorded? Calculate the mean of the temperature / humidity after you have corrected the dataset.

Sample Solution

# Mean values of the incorrect sensor data: 0 flows into the calculation
# and distorts the mean
mean(sensor_fail$Temp)
## [1] -0.13

# Mean values of the corrected sensor data: with na.rm = TRUE,
# NA values are removed from the calculation.
mean(sensor_fail$Temp_correct, na.rm = TRUE)
## [1] -0.1857143