|>
df_aqi glimpse()
Handling Time series datasets
Welcome to the world of Time Series Analysis and forecasting!!
1 Introduction
Once the data has been collected, based on data science principles, it is recommended to undertake an exploratory data analysis for better understanding of the dataset and its characteristics. In time series analysis and forecasting projects, the same is applied as visualizations and extraction of time series features. We shall be dwelling on these concepts using the datasets loaded for the workshop till now.
2 First things first: Ensure that you have a time series dataset.
In many occasions, though the data collection has been carried out for a time series analysis, the capture of the timeline may not be adequate to conduct a time series analysis. The dataset may be imported into the analysis environment as a spreadsheet or a dataframe. Therefore, at the outset, we should look at the variable containing timeline and ensure that the dataset is a time series data.
2.1 Understanding the time stamps.
In any time series data project, it is critical to understand whether the data has been collected as a regular or irregular time series data. Further, depending upon the research question and need of the study, a common time stamp is required for all the datasets for creating forecasting models in subsequent phases of analysis and interpretation. Let’s look at the data sets and understand it further.
2.1.1 Time Series Air Quality Data of Manali (2010-2023)
We shall be using the function glimpse()
which makes it possible to see every column in a data frame and understand the class
of the variables.
Rows: 458
Columns: 18
$ `Regional Office Lab` <chr> "RO-Kullu", "RO-Kullu", "RO-Kullu", "RO-Kullu", …
$ City <chr> "Manali", "Manali", "Manali", "Manali", "Manali"…
$ `Station Name` <chr> "Manali-I", "Manali-I", "Manali-I", "Manali-I", …
$ `Sample Date` <dttm> 2020-04-08, 2020-04-10, 2020-04-20, 2020-04-22,…
$ `PM10 (µg/m³)` <dbl> 12.31, 14.87, 9.45, 8.89, 9.09, 11.93, 11.04, 16…
$ `PM2.5 (µg/m³)` <dbl> 8.31, 6.23, 4.25, 6.24, 5.00, 8.33, 8.39, 8.39, …
$ `SO₂ (µg/m³)` <dbl> 1.08, 1.20, 1.09, 1.21, 1.13, 1.47, 1.62, 1.50, …
$ `NOₓ (µg/m³)` <dbl> 3.62, 4.19, 3.78, 4.00, 3.91, 4.31, 4.94, 4.57, …
$ `NH₃ (µg/m³)` <dbl> 2.16, 1.63, 1.64, 1.38, 1.62, 1.19, 1.84, 1.70, …
$ `O₃ (µg/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Pb (µg/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `CO (mg/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `C₆H₆ (µg/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `BaP (ng/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `As (ng/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Ni (ng/m³)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ AQI <chr> "14", "15", "9", "10", "9", "14", "14", "16", "2…
$ `AQI Condition` <chr> "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", …
Looking at the class, it is understood that the Sample Date
in the data is stored as a dttm
or date time variable. Also, PM10 (µg/m³)
and other air quality variables are stored as character
variables and not as numeric
. Therefore, we shall use ymd()
function from the package lubridate
and tidyverse
functions to create a time variable and convert character
to numeric
variable.
lubridate
package from tidyverse is a powerful tool for handling dates and time in a dataset.
<- df_aqi |>
df_aqi mutate(`Sample Date` = ymd(`Sample Date`)) |>
mutate(`PM10 (µg/m³)` = as.numeric(`PM10 (µg/m³)`))
Let us now determine the timestamp for the dataset
|>
df_aqi select(`Sample Date`, `PM10 (µg/m³)`) |>
head()
Sample Date PM10 (µg/m³)
1 2020-04-08 12.31
2 2020-04-10 14.87
3 2020-04-20 9.45
4 2020-04-22 8.89
5 2020-04-24 9.09
6 2020-05-04 11.93
<- df_aqi |>
df_aqi select(`Sample Date`, `PM10 (µg/m³)`)
It is understood that the sampling dates are not on daily time stamps and are irregular in nature. therefore, to further proceed, we shall be converting them to weekly time stamps.
<- df_aqi |>
df_aqi mutate(`Sample Date` = yearmonth(`Sample Date`)) |>
group_by(`Sample Date`) |>
summarise(`PM10 (µg/m³)` = mean(`PM10 (µg/m³)`,
na.rm = T))
# A tibble: 6 × 2
`Sample Date` `PM10 (µg/m³)`
<mth> <dbl>
1 2020 Apr 10.9
2 2020 May 19.8
3 2020 Jun 33.1
4 2020 Jul 33.5
5 2020 Aug 26.2
6 2020 Sep 43.0
Choosing a time stamp is of utmost importance in time series analysis when dealing with irregular time series data. It is advisable to use a parsimonious approach based on both visualizations and statistical measures for noise and entropy for decision making.
3 Creating time series objects.
tsibble
objects extend tidy
data frames (tibble
objects) by introducing temporal structure. We have set the time series index to be the Year column, which associates the measurements with the time of recording.
Now, since we are aware of the timestamps, the next step is to create tsibble
objects for extracting the time series characteristics from a time series dataset. A time series can be thought of as a list of numbers (the measurements), along with some information about what times those numbers were recorded (the index). This information can be stored as a tsibble
object in R. We shall be using tsibble
function from the package tsibble
for the same.
<- df_disaster |>
df_disaster tsibble(
index = Year
)
# A tsibble: 6 x 2 [1Y]
Year deaths_earthquakes
<dbl> <dbl>
1 1900 560
2 1901 72
3 1902 31944
4 1903 26520
5 1904 612
6 1905 83473
In scenarios wherein the time stamps are quarterly, monthly, or weekly, respective index functions are used.
A tsibble
also allows multiple time series to be stored in a single object. The details of the same are stored in key variable. For example, time series data of men and women.
<- df_rta |>
df_rta mutate(timeline = dmy(timeline)) |>
mutate(Month = yearmonth(timeline)) |>
group_by(Month) |>
summarise(accidents = sum(`Total Accident`)) |>
tsibble(
index = Month
)
# A tsibble: 6 x 2 [1M]
Month accidents
<mth> <int>
1 2014 Jan 41954
2 2014 Feb 39899
3 2014 Mar 42524
4 2014 Apr 39867
5 2014 May 45404
6 2014 Jun 42448