Handling Time series datasets

Welcome to the world of Time Series Analysis and forecasting!!

1 Introduction

Once the data has been collected, based on data science principles, it is recommended to undertake an exploratory data analysis for better understanding of the dataset and its characteristics. In time series analysis and forecasting projects, the same is applied as visualizations and extraction of time series features. We shall be dwelling on these concepts using the datasets loaded for the workshop till now.

2 First things first: Ensure that you have a time series dataset.

In many occasions, though the data collection has been carried out for a time series analysis, the capture of the timeline may not be adequate to conduct a time series analysis. The dataset may be imported into the analysis environment as a spreadsheet or a dataframe. Therefore, at the outset, we should look at the variable containing timeline and ensure that the dataset is a time series data.

2.1 Understanding the time stamps.

In any time series data project, it is critical to understand whether the data has been collected as a regular or irregular time series data. Further, depending upon the research question and need of the study, a common time stamp is required for all the datasets for creating forecasting models in subsequent phases of analysis and interpretation. Let’s look at the data sets and understand it further.

2.1.1 Time Series Air Quality Data of Manali (2010-2023)

We shall be using the function glimpse() which makes it possible to see every column in a data frame and understand the class of the variables.

df_aqi |> 
  glimpse()

Rows: 458
Columns: 18
$ `Regional Office Lab` <chr> "RO-Kullu", "RO-Kullu", "RO-Kullu", "RO-Kullu", …
$ City                  <chr> "Manali", "Manali", "Manali", "Manali", "Manali"…
$ `Station Name`        <chr> "Manali-I", "Manali-I", "Manali-I", "Manali-I", …
$ `Sample Date`         <dttm> 2020-04-08, 2020-04-10, 2020-04-20, 2020-04-22,…
$ `PM10 (µg/m³)`        <dbl> 12.31, 14.87, 9.45, 8.89, 9.09, 11.93, 11.04, 16…
$ `PM2.5 (µg/m³)`       <dbl> 8.31, 6.23, 4.25, 6.24, 5.00, 8.33, 8.39, 8.39, …
$ `SO₂ (µg/m³)`         <dbl> 1.08, 1.20, 1.09, 1.21, 1.13, 1.47, 1.62, 1.50, …
$ `NOₓ (µg/m³)`         <dbl> 3.62, 4.19, 3.78, 4.00, 3.91, 4.31, 4.94, 4.57, …
$ `NH₃ (µg/m³)`         <dbl> 2.16, 1.63, 1.64, 1.38, 1.62, 1.19, 1.84, 1.70, …
$ `O₃ (µg/m³)`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Pb (µg/m³)`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `CO (mg/m³)`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `C₆H₆ (µg/m³)`        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `BaP (ng/m³)`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `As (ng/m³)`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Ni (ng/m³)`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ AQI                   <chr> "14", "15", "9", "10", "9", "14", "14", "16", "2…
$ `AQI Condition`       <chr> "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", …

Looking at the class, it is understood that the Sample Date in the data is stored as a dttm or date time variable. Also, PM10 (µg/m³) and other air quality variables are stored as character variables and not as numeric. Therefore, we shall use ymd() function from the package lubridate and tidyverse functions to create a time variable and convert character to numeric variable.

Note

lubridate package from tidyverse is a powerful tool for handling dates and time in a dataset.

df_aqi <- df_aqi |> 
  mutate(`Sample Date` = ymd(`Sample Date`)) |> 
  mutate(`PM10 (µg/m³)` = as.numeric(`PM10 (µg/m³)`))

Let us now determine the timestamp for the dataset

df_aqi |> 
  select(`Sample Date`, `PM10 (µg/m³)`) |> 
  head()

  Sample Date PM10 (µg/m³)
1  2020-04-08        12.31
2  2020-04-10        14.87
3  2020-04-20         9.45
4  2020-04-22         8.89
5  2020-04-24         9.09
6  2020-05-04        11.93

df_aqi <- df_aqi |> 
  select(`Sample Date`, `PM10 (µg/m³)`)

It is understood that the sampling dates are not on daily time stamps and are irregular in nature. therefore, to further proceed, we shall be converting them to weekly time stamps.

df_aqi <- df_aqi |> 
  mutate(`Sample Date` = yearmonth(`Sample Date`)) |> 
  group_by(`Sample Date`) |> 
  summarise(`PM10 (µg/m³)` = mean(`PM10 (µg/m³)`, 
                                  na.rm = T))

# A tibble: 6 × 2
  `Sample Date` `PM10 (µg/m³)`
          <mth>          <dbl>
1      2020 Apr           10.9
2      2020 May           19.8
3      2020 Jun           33.1
4      2020 Jul           33.5
5      2020 Aug           26.2
6      2020 Sep           43.0

Caution

Choosing a time stamp is of utmost importance in time series analysis when dealing with irregular time series data. It is advisable to use a parsimonious approach based on both visualizations and statistical measures for noise and entropy for decision making.

3 Creating time series objects.

The index variable.

tsibble objects extend tidy data frames (tibble objects) by introducing temporal structure. We have set the time series index to be the Year column, which associates the measurements with the time of recording.

Now, since we are aware of the timestamps, the next step is to create tsibble objects for extracting the time series characteristics from a time series dataset. A time series can be thought of as a list of numbers (the measurements), along with some information about what times those numbers were recorded (the index). This information can be stored as a tsibble object in R. We shall be using tsibble function from the package tsibble for the same.

df_disaster <- df_disaster |> 
  tsibble(
  index = Year
)

# A tsibble: 6 x 2 [1Y]
   Year deaths_earthquakes
  <dbl>              <dbl>
1  1900                560
2  1901                 72
3  1902              31944
4  1903              26520
5  1904                612
6  1905              83473

In scenarios wherein the time stamps are quarterly, monthly, or weekly, respective index functions are used.

The key variable

A tsibble also allows multiple time series to be stored in a single object. The details of the same are stored in key variable. For example, time series data of men and women.

df_rta <- df_rta |> 
  mutate(timeline = dmy(timeline)) |> 
  mutate(Month = yearmonth(timeline)) |> 
  group_by(Month) |> 
  summarise(accidents = sum(`Total Accident`)) |> 
  tsibble(
  index =  Month
)

# A tsibble: 6 x 2 [1M]
     Month accidents
     <mth>     <int>
1 2014 Jan     41954
2 2014 Feb     39899
3 2014 Mar     42524
4 2014 Apr     39867
5 2014 May     45404
6 2014 Jun     42448