Introduction to time series analysis

Welcome to the world of Time Series Analysis and forecasting!!

1 Unveiling the Dynamics of Time: An Introduction to Time Series Data Analysis and Forecasting.

In the ever-evolving landscape of data science, understanding the intricacies of time series data has become increasingly crucial. As we embark on a journey into the realm of temporal patterns and sequential dependencies, this workshop aims to demystify the concepts and methodologies that constitute time series analysis and forecasting. Time series data, a distinctive form of information, holds the key to unraveling trends, forecasting future events, and gaining profound insights into diverse fields such as finance, economics, climate science, and beyond. So, lets dive in!!.

2 What is time series data?

Time series data is a type of data that is collected or recorded over a sequence of time intervals. Time series data is one of the most common formats of data, and it is used to describe an event or phenomena that occurs over time. Time series data has a simple requirement—its values need to be captured at equally spaced time intervals, such as seconds, minutes, hours, days, months, and so on. This important characteristic is one of the main attributes of the series and is known as the frequency of the series. We usually add the frequency along with the name of the series. In healthcare, some of the examples of time series data include:

Daily patient records
Hourly test results
Monthly disease incidence rates
Daily air quality indicators
Weekly admissions to an emergency department
Annual expenditures on health care

A univariate time series is a sequence of measurements of the same variable collected over time.

Think! How data collection change your analysis strategy?

How would you collect data regarding hospitalizations? Would you like to have data wherein the last hospitalization record is maintained and updated or would you like to have a single data entry point for each hospitalization?

3 Why time series analysis and forecasting?

3.1 Time series analysis.

Time series analysis is the art of extracting meaningful insights and revealing patterns from time series data using statistical and data visualization approaches. These insights and patterns can then be utilized to explore past events and forecast future values in the series. There are three different aims of Time series Analysis:-

Descriptive analysis.
Explanatory analysis.
Forecasting.

3.2 Forecasting

Forecasting in healthcare is crucial for various aspects of planning, resource allocation, and managing patient care efficiently. Here are some examples where forecasting plays a vital role:

Disease Outbreaks and Epidemic Prediction: Forecasting the spread of infectious diseases, such as influenza or COVID-19, helps in preparing healthcare systems for potential surges in cases. It aids in vaccine distribution, setting up quarantine measures, and ensuring sufficient medical supplies and staff are available.
Patient Admission Rates: Hospitals and clinics use forecasting to predict patient admission rates. This helps in staffing decisions, ensuring there are enough healthcare professionals on duty to meet demand, and in planning bed occupancy rates to optimize the use of available resources.
Pharmaceutical Supply Chain: Forecasting is used to predict the demand for various medications, helping pharmacies and hospitals maintain an optimal inventory. This is crucial for managing costs, reducing waste, and ensuring that essential medicines are always in stock, especially for chronic conditions or in emergency situations.
Surgical and Procedure Needs: By predicting the demand for certain types of surgeries or medical procedures, healthcare providers can better schedule operating rooms, allocate medical staff, and ensure the necessary equipment and supplies are available, thereby improving patient care and operational efficiency.
Staffing Requirements: Forecasting helps predict staffing needs based on various factors, including seasonal trends in illnesses, epidemic outbreaks, or changing demographics. This ensures that there are enough healthcare workers, including doctors, nurses, and support staff, to provide quality care without overburdening the existing workforce.
Healthcare Policy and Infrastructure Planning: Long-term forecasts are used by policymakers to plan healthcare infrastructure, such as the construction of new hospitals or clinics, expansion of existing facilities, and investment in new technologies. These forecasts consider population growth, aging, and changes in disease prevalence.
Preventive Care Needs: By forecasting trends in various diseases or health conditions, healthcare providers can plan and implement preventive care measures more effectively. This could include vaccination campaigns, public health initiatives, or screening programs aimed at early detection of conditions like cancer, diabetes, or heart disease.

Forecasting in healthcare utilizes a variety of data sources, including historical health data, demographic trends, environmental factors, and epidemiological models. The goal is to make informed decisions that improve patient care, enhance operational efficiency, and effectively manage resources. However, it is important to understand that the predictability of an event or a quantity depends on several factors including:

How well we understand the factors that contribute to it;
How much data is available;
How similar the future is to the past;
Whether the forecasts can affect the thing we are trying to forecast.

4 Datasets used in the workshop.

4.1 Time Series Air Quality Data of Manali (2010-2023)

Table 1:
Overview: Air Quality
Regional Office Lab	City	Station Name	Sample Date	PM10 (µg/m³)	PM2.5 (µg/m³)	SO₂ (µg/m³)	NOₓ (µg/m³)	NH₃ (µg/m³)	O₃ (µg/m³)	Pb (µg/m³)	CO (mg/m³)	C₆H₆ (µg/m³)	BaP (ng/m³)	As (ng/m³)	Ni (ng/m³)	AQI	AQI Condition
RO-Kullu	Manali	Manali-I	2020-04-08 00:00:00	12.31	8.31	1.08	3.62	2.16								14	GOOD
RO-Kullu	Manali	Manali-I	2021-01-13 00:00:00	48.86		2.00	11.24	3.06								49	GOOD

The dataset shared for the workshop (Table 1) is a subset of air quality data from SPCB website. The shared dataset includes data for one randomly choosen station in Manali, HP. The dataset includes data on PM 10, Pm 2.5, AQI and multiple other parameters from the meterological station.

df_aqi <- rio::import(here::here("data",
                                 "aqidata_IPHACON2024.xlsx"))

4.2 Global Deaths from Earthquakes (1900-2023)

Table 2:
Overview: Earthquake deaths
Year	deaths_earthquakes
1,900	560
1,920	731,992
1,999	87,619
2,023	218,608

As shown in Table 2 the dataset used in the workshop is created to show global number of deaths due to earthquakes. Poverty, disease, hunger, climate change, war, existential risks, and inequality: The world faces many great and terrifying problems. It is these large problems that the work at Our World in Data focuses on. Our World in Data’s mission is to publish the “research and data to make progress against the world’s largest problems”.

df_disaster <- rio::import(here::here("data",
                       "earthquakes_IPHACON2024.xlsx"))

4.3 Monthly Road Traffic accidents in India (2014-2018)

Table 3:
Overview: RTAs in India
timeline	Total Accident
01/04/2014	39,867
01/08/2018	35,845
01/11/2018	38,417
01/09/2018	35,387

As shown in Table 3 the dataset used in the workshop is subset of data taken from https://data.gov.in/. It has the number of road accidents from 2014-2018 in India.

df_rta <- rio::import(here::here("data",
                       "RTA_monthly_India_IPHACON2024.csv"))

4.4 Maternal Mortality Ratio: India (2000-2020)

Table 4:
Overview: Reporting
Year	MMR
2,000	384
2,004	301
2,020	103

As shown in Table 4 the dataset used in the workshop is Maternal mortality ratio is the number of women who die from pregnancy-related causes while pregnant or within 42 days of pregnancy termination per 100,000 live births. The data are estimated with a regression model using information on the proportion of maternal deaths among non-AIDS deaths in women ages 15-49, fertility, birth attendants, and GDP measured using purchasing power parities (PPPs). The source of the dataset is WHO, UNICEF, UNFPA, World Bank Group, and UNDESA/Population Division. Trends in Maternal Mortality 2000 to 2020. Geneva, World Health Organization, 2023

df_mmr <- rio::import(here::here("data",
                       "mmr_IPHACON2024.xls"))

4.5 Campylobacter cases in Germany (2001-2011)

Table 5:
Overview: Campylobacter cases
date	case
2001-12-31 00:00:00	514
2002-01-28 00:00:00	815
2002-05-20 00:00:00	869

As shown in Table 5 the dataset used in the workshop is the counts of campylobacter cases reported in Germany between 2001 and 2011. The dataset is obtained from the surveillance package

df_campylo <- rio::import(here::here("data",
                       "campylobacter_IPHACON2024.xlsx"))

4.6 M750 data

Table 6:
Overview: M750
id	date	value
M750	1990-01-01 00:00:00	6,370
M750	1994-02-01 00:00:00	7,240
M750	2006-08-01 00:00:00	8,580
M750	2015-06-01 00:00:00	11,000

The Table 6 represents data from the fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The competition included 100,000 time series datasets. This dataset includes The 750th Monthly Time Series used in the competition.

df_750 <- rio::import(here::here("data",
                       "m750_IPHACON2024.xlsx"))

4.7 Anti-Diabetes drug sales

Table 7:
Overview: Anti-Diabetic Drug sales
Month	TotalC
1991 Jul	3,526,591
1995 Aug	5,855,277
2003 Dec	16,503,966
2008 Feb	21,654,285

The Table 7 represents data from the fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The competition included 100,000 time series datasets. This dataset includes The 750th Monthly Time Series used in the competition.

df_diab <- PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarise(TotalC = sum(Cost))

4.8 COVID-19 data

We shall be using this dataset for exercises and hence is not introduced here.