Sample Solution
library("readr")
sensor1 <- read_delim("datasets/prepro/sensor1.csv", ";")
sensor2 <- read_delim("datasets/prepro/sensor2.csv", ";")
sensor3 <- read_delim("datasets/prepro/sensor3.csv", ";")You have data from three sensors (sensor1.csv, sensor2.csv, sensor3.csv). Read in the data sets using the library readr.
library("readr")
sensor1 <- read_delim("datasets/prepro/sensor1.csv", ";")
sensor2 <- read_delim("datasets/prepro/sensor2.csv", ";")
sensor3 <- read_delim("datasets/prepro/sensor3.csv", ";")From the 3 data frames, create a single data frame that looks like the one shown below. Use two joins from dplyr to connect 3 data.frames. Then tidy up the column names (how can we do that?).
library("dplyr")
sensor1_2 <- full_join(sensor1, sensor2, "Datetime")
sensor1_2 <- rename(sensor1_2, sensor1 = Temp.x, sensor2 = Temp.y)
sensor_all <- full_join(sensor1_2, sensor3, by = "Datetime")
sensor_all <- rename(sensor_all, sensor3 = Temp)| Datetime | sensor1 | sensor2 | sensor3 |
|---|---|---|---|
| 16102017_1800 | 23.5 | 13.5 | 26.5 |
| 17102017_1800 | 25.4 | 24.4 | 24.4 |
| 18102017_1800 | 12.4 | 22.4 | 13.4 |
| 19102017_1800 | 5.4 | 12.4 | 7.4 |
| 23102017_1800 | 23.5 | 13.5 | NA |
| 24102017_1800 | 21.3 | 11.3 | NA |
Import the sensor_fail.csv file into R.
sensor_fail <- read_delim("datasets/prepro/sensor_fail.csv", delim = ";")sensor_fail.csv has a variable SensorStatus: 1 means the sensor is measuring, 0 means the sensor is not measuring. If sensor status = 0, the Temp = 0 value is incorrect. It should be NA (not available). Correct the dataset accordingly.
| Sensor | Temp | Hum_% | Datetime | SensorStatus |
|---|---|---|---|---|
| Sen102 | 0.6 | 98 | 16102017_1800 | 1 |
| Sen102 | 0.3 | 96 | 17102017_1800 | 1 |
| Sen102 | 0.0 | 87 | 18102017_1800 | 1 |
| Sen102 | 0.0 | 86 | 19102017_1800 | 0 |
| Sen102 | 0.0 | 98 | 23102017_1800 | 0 |
| Sen102 | 0.0 | 98 | 24102017_1800 | 0 |
| Sen102 | 0.0 | 96 | 25102017_1800 | 1 |
| Sen103 | -0.3 | 87 | 26102017_1800 | 1 |
| Sen103 | -0.7 | 98 | 27102017_1800 | 1 |
| Sen103 | -1.2 | 98 | 28102017_1800 | 1 |
# with base-R:
sensor_fail$Temp_correct[sensor_fail$SensorStatus == 0] <- NA
sensor_fail$Temp_correct[sensor_fail$SensorStatus != 0] <- sensor_fail$Temp # Warning message can be ignored.
# the same with dplyr:
sensor_fail <- sensor_fail |>
mutate(Temp_correct = ifelse(SensorStatus == 0, NA, Temp))Why does it matter if 0 or NA is recorded? Calculate the mean of the temperature / humidity after you have corrected the dataset.
# Mean values of the incorrect sensor data: 0 flows into the calculation
# and distorts the mean
mean(sensor_fail$Temp)
## [1] -0.13
# Mean values of the corrected sensor data: with na.rm = TRUE,
# NA values are removed from the calculation.
mean(sensor_fail$Temp_correct, na.rm = TRUE)
## [1] -0.1857143