Piecewise linear modelling and change-point analysis of COVID-19 outbreak in Malaysia

In Malaysia, COVID-19 were first detected as imported cases on 25 January and as local infection on 4 February 2020. A surge of positive cases ensued by March 2020 which led to a series of countrywide containment and mitigation measures known as Movement Control Order (MCO). We study the direct effects of MCO on the course of epidemic by analyzing the cumulative and daily infection cases of COVID-19 up to 31 December 2020 in Malaysia and its states using piecewise linear regression and segment neighborhoods algorithm of change-point analysis, respectively. Through piecewise regression on nationwide cases, MCO were likely to almost flatten the epidemic curve in just one month after it was first initiated. While for stateswise cases, the average length of series of concave downward is six months before it turn to concave upward, indicating the period of which deceleration of new cases can be expected. However, the starting of this wave of COVID-19 can be relatively vary for three months in different states and federal territories. Together with change-point analysis on daily cases, the statewise epidemic phases could be subdivided into two to four regimes, whereby the majority of phase transitions fall in April and last quarter of 2020. Overall, the statistical modelling shows that the immediate effect of MCO appears to be effective.


Introduction
Coronavirus Disease 2019 (COVID- 19) is an acute respiratory illness caused by a novel strain of coronavirus, namely Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-COV-2), believed to be derived from bats [1−2]. The disease first emerged in December 2019 in Wuhan city of Hubei Province; the epicenter of COVID-19 outbreak in China. Within three months, the disease has spread interminably on a global scale prompting the World Health Organization (WHO) to declare it as a pandemic on March 2020 [3]. As of 23 February 2021, there have been approximately 111,400,000 laboratory-confirmed cases of COVID-19 reported worldwide with 2,470,772 fatalities according to WHO [4].
Malaysia initially joined the extensive list of countries affected by the novel coronavirus when three imported cases of COVID-19 were detected on 25 January 2020 amongst Chinese nationals who visited Malaysia via Singapore on 23 January 2020 [5]. By the beginning of February, the first local infection was identified on 4 February 2020, in which the victim had been travelling to Singapore for business purposes [6−7]. Two days later, Malaysia announced its first case of COVID-19 infected via local  [7]. A surge of positive cases ensued by March 2020 and eventually the first reported death case was confirmed on 17 March 2020 [8]. In response to the rapid growth of cases in Malaysia, a country-wide lockdown formally known as the Movement Control Order (MCO) was implemented on the very next day. Subsequently, several different stages of MCO have been enforced as the country's containment and mitigation strategies [9].
Prior modelling studies in regards to the Malaysian COVID-19 situation have been focusing on deterministic ordinary differential equation (ODE) models vis-à-vis a statistical approach, in which several authors have applied the compartmental framework while investigating the transmission dynamics of COVID-19 in Malaysia [9−15]. Five separate researchers [10][11][12][13] and [15] have developed the classic, compartmental Susceptible-Infectious-Recovered (SIR) model and predicted the infection peak period to be in the middle of March 2020 [10][11][12][13]. Two group of authors [9] and [14] have incorporated the incubation period and included an additional Exposed compartment in order to utilize the Susceptible-Exposed-Infectious-Recovered (SEIR) framework in their models.
At the time of writing, only two group of authors [16][17] have opted for the statistical approach in studying the transmissibility and transmission pattern of Malaysian COVID-19, as per our knowledge. Amiruzzaman et al. [16] employed the logistic growth curve to model the Malaysian COVID-19 infection and suggested that the first stage of MCO had a more substantial impact in terms of the virus containment. Erde et al. [17] implemented a time series forecasting of COVID-19 infected cases by deploying the auto regressive integrated moving average (ARIMA) model. Findings have demonstrated that the MCO measures could stabilize the growth rate of COVID-19 infection in Malaysia.
Due to the dearth of studies, our work aims at comprehending the various stages of MCO in Malaysia using piecewise linear regression for the cumulative COVID-19 positive cases and change-point analysis on the daily cases in Malaysia. State-wide cumulative and daily cases for all thirteen states and three federal territories from 13 March to 31 December 2020 are included in the analysis. We aim at giving a coarse description on spatio-temporal variability of the COVID-19 spread in Malaysia through statistical analysis.
It is commonly expected that time lags exist in between the dates of epidemic mitigation measures are initiated and the abrupt change of disease spreading rate. These time lags could be caused by observation delay, which is a combination of incubation period and delayed test results [18]. Consequently, it is difficult to predetermine the exact date when the MCO shows an effect. Hence, we perform the time series structural change analysis by first partitioning the cumulative infection data into several linear segments with different slopes through piecewise linear regression model, so as to analyze the trajectory of infection curve of COVID-19 in Malaysia in terms of its infection growth rate and the emergence of phases of transition. As each phase could share common statistical properties such as mean and variance across its observations, we then perform the change-point analysis of daily infection cases to further assess the division of the cumulative infection curve obtained from piecewise regression.
Similar to the construction of piecewise linear trend model in [19], our simple inspection of cumulative infection data enables us to better understand the association of growth patterns of COVID-19 cases and the date when various stages of MCO were implemented in Malaysia. This could present a timely evaluation of the effectiveness and impact of MCO in combating the spread of COVID-19 in entire Malaysia and also different states, as the first or second breakpoint after the steepest slope segment in piecewise linear regression is greatly related to the date of the containment measures were put in place. We make use of change-point analysis as the second approach in quantifying the start and end of one period transitions into another. This is important to verify whether the segmentation of piecewise regression makes sense by taking into account its statistical properties over segments.
In the following, we briefly give the scope of data used. We propose to model the logarithmic scale of cumulative confirmed cases using piecewise regression as well as to study the statistical properties of segmentation in daily infection cases through change-point analysis. Results show that several transition phases could be derived based on both methods, which would in turn help in evaluating the effect of containment measures on the COVID-19 outbreak in Malaysia.

Data source
We focus on cumulative COVID-19 infection cases for entire Malaysia as well as all 13 states and 3 federal territories. We obtain the COVID-19 data for entire Malaysia from https://ourworldindata.org/coronavirus-source-data maintained by "Our World in Data". For statewise level, data published by Ministry of Health Malaysia on the cumulative number of COVID-19 infected cases from 13 March to 31 December 2020 were analyzed.
We first give the evolutions of the cumulative infected cases in entire Malaysia in figure 1, which also depicts several stages of Movement Control Order (MCO) (see table 1) implemented by the federal government of Malaysia in response to the COVID-19 pandemic in the country. Even though the nation's first detection of COVID-19 was reported on 25 January 2020, the two states (Sabah and Sarawak) in the eastern part of Malaysia remained COVID-free until 13 and 12 March, respectively. Hence, we focus on the cumulative and daily confirmed cases of COVID-19 since 13 March 2020 in all thirteen states and three federal territories in Malaysia, until 31 December 2020 which marks the end of one of the stages of MCO.

Piecewise linear regression model
It is well known that the early stages of epidemics typically exhibit exponential growth ( ) = e , which is the solution to simple first-order ordinary differential equation where I(t) is a time-dependent variable for infected cases,  is the initial infected number and  is the epidemic growth rate. Equation (1) implies that the change in the cumulative of infected cases is proportional to the current cumulative cases. In the context of modelling the relationship between the variables by fitting a linear equation, we take ln on both sides of ( ) = and let ≡̂, ln ≡̂ to arrive at ln̂( ) =̂+̂ (2) where ̂( ) is the predicted cumulative number of reported infected cases at day t, ̂ is the estimated intercept at vertical axis and ̂ is the estimated slope coefficient. The left-hand side of equation (2) is in natural logarithmic scale such that the estimated slope (̂) naturally measures the growth rate of the epidemic at day t. It is easy to note that differentiating both sides of equation (2) and dropping hat symbols, we recover exponential model (1). Also, the slope ̂ can be used to estimate the epidemic doubling time, ∆ = (ln 2)/̂. Hence, we carry out linear regression for the logarithmic transformation of the cumulative infected curves of Malaysia along with all states and federal territories. The estimated slope of linear regression gives the growth rate  for the infection. The growth rate is important in measuring the spreading speed of an epidemic. If the rate is positive (resp. negative), then the disease can (resp. cannot) invade a population, where the zero rate is indeed a disease threshold. However, as linear regression assumes a uniform growth rate over entire period of time, it may not realistic in modelling epidemic in presence of interventions. Due to the non-linear behavior of the timedependent cumulative case data, there is no single linear function appears to fit with the cumulative infected cases of either Malaysia or all states and federal territories in logarithm scale. Hence, we employ piecewise (a.k.a. segmented) linear regression for constructing best-fit linear function onto different ranges of t. A segmented linear model is defined by slope parameters and breakpoints where the slope of the linear function changes between two adjacent segments, which may be represented by where  is the slope of segment j ( j = 1, 2, …, m+1),  is the location of slope changes (i.e. breakpoint), ( −  ) is an indicator function equal to one when   and 0 otherwise, and m is the number of breakpoints.
Having said that, in piecewise curve fitting, the number of breakpoints is typically unknown. As the curves of the cumulative infection generally manifest transitions from exponential regime to moderate growth and then to flatter curve, we start our curve fitting procedure with minimal breakpoints of two and the initial guess of these breakpoints is determined visually. As R-squared measures how successful the fit is in explaining the variation of the data in which a value closer to 1 indicating a better fit, whenever a desirable adjusted R-squared value that is greater than 0.99 is obtained, then the segmented linear model is chosen as the best-fit piecewise function. Otherwise, the piecewise curve fitting procedure is re-carried out by adding an additional number of initial guess of breakpoint on the portion of fit curve that does not match well with the data points. The respective best-fit piecewise linear functions, together with its cumulative infection curve in semi-logy plots, for entire Malaysia is shown in figure 2, while for all states and federal territories, the plots are displayed in figure 3 in descending order of magnitude in total cases as of 31 December 2020.

Change point analysis
Apart from partitioning the cumulative infection curves into several segments with different slopes (and intercepts), we examine the daily cases of COVID-19 in all states and federal territories, through changepoint analysis of its statistical properties (either/both mean or/and variance). In this paper, we choose to reserve the terminology "change point" for an instance in time where the statistical properties before and after this time point differ, whereas the terminology "break point" represents the transition between 5 different slopes of piecewise linear regression model, despite the fact that they are usually interchangeable in time series structural analysis. On account of this, we are interested to give a simple comparison for the abrupt variations from the perspectives of aggregated (cumulative) cases and the individual (daily) cases. That is, to what extent, the shift of the epidemic phase in terms of its disease growth rate and its basic statistical properties of mean or variance of daily cases agree with each other.
Rather than finding the optimal positioning and potential number of change points for daily infected cases in mean and/or variance, we decide the number of change points based on the number of breakpoints in our fitted piecewise regression model for cumulative infection curves. For this purpose, we employ the segment neighborhood algorithm [20] in Changepoint package [21] by presetting the number of change-point to search for. We run the algorithm for detecting the abrupt changes in following three statistical properties, namely mean, variance as well as combination of both mean and variance. Only one of these that returns change-points identified closest to the breakpoints in its piecewise linear regression counterpart will be used for comparison purpose.

Analysis of cumulative confirmed cases in entire Malaysia
At nationwide level, the growth rate in the first segment (S 1 =0.1016) in figure 2 reveals the first wave of COVID-19 in Malaysia, which can be attributable to the imported cases from foreign nationals entering Malaysia through Singapore and China [5]. During this period, the local government had to conduct extensive health screening with thermal scanners and check-ups at all entry points into the country. The implication of this containment measures can be exemplified in the reduction of the slope on second segment (S 2 =0.0130) whereby the source of the bulk of infected cases during this time has been shifted from imported cases to local transmission cases instead. Apparently, the largest slope is found to be on third segment (S 3 =0.1924) which lies in between 29 February and 23 March. Since exponential laws will show up as straight lines in logarithm scale, this segmented period of time could be regarded as the most significant exponential regime of the epidemic curve for entire Malaysia. The rapid growth of COVID-19 cases in this period was highly associated to a religious mass-gathering which took place in a Malaysian mosque, from 27 February to 1 March, that kicked off the second wave of COVID-19 infections in the country [22]. The Malaysian Health Ministry also declared it as the first local transmission cluster in Malaysia [23]. To combact this second wave, the Movement Control Order (MCO) was initiated on 18 March 2020. From our piecewise linear segment model, we found that the epidemic growth rate became moderate after 23 March and further reduced to almost flattening from 20 April onwards. However, from 21 September onwards, the entire Malaysia could be considered as entering the third wave of COVID-19 with the slope of linear segment S 6 = 0.0255.

Analysis of cumulative confirmed cases in 13 states and 3 federal territories
In statewise level, the semi-logy plots of cumulative infection cases for all states and federal territories, together with their fitted piecewise linear regression models are portrayed in figure 3. Unlike those nations with the order of magnitude of cumulative cases in 10 5 to 10 6 that experiencing the transition from exponential to power-law regime before flattening of the epidemic curve [24], the cumulative cases for all states and federal territories in Malaysia up to 31 December 2020 are in the orders ranging from tens to tens of thousands. Therefore, they can be fit well by linear functions with adjusted R-squared all close to 1. As m breakpoints will split the data into m+1 segments, the slopes for these segments (i.e. S 1 , S 2 , …, S m , S m+1 ) together with its breakpoints (BP) are given in table 2. With slope in each line segment, the doubling period of COVID-19 cases can be estimated. For instance, at its third breakpoint (BP3) on 5 September 2020, the cumulative cases in Sabah is 433 and the estimated slope of line segment S 4 is 0.0641. By calculating (ln 2)/0.0641, we obtain the estimated doubling period as 10.81 days. As a matter of fact, Sabah recorded 877 cumulative cases on 15 September 2020, which was just slightly higher than two times of 433. This shows that the resulting segmentation from piecewise regression can be useful in capturing the spread of the COVID-19 in all states and federal territories. As we list the states and federal territories in decreasing order of magnitude, the number of breakpoints estimated is apparently not related to its order of magnitude. We highlight by bold font the greatest slope among the linear segments in table 2 and find that all (except Labuan) have the highest slope in its first linear segment (S 1 ), in which these steepest segments are all (except Labuan) ended before or on 3 April 2020. General speaking, this early stage of the pandemic often depicts the worst scenario before the interventions first show an effect. This suggests that the nationwide interventions from the federal government managed to bring down the statewise slopes by a significant amount in just roughly two weeks since the implementation of MCO on 18 March 2020. Thereafter, the spread rates can be considered as moderate or even slow for a window of more than 4 months. To further investigate these phase transitions, we illuminate the concave upward slope increasing line segments as compared to its previous line segment by underlining those slope increases in table 2. On this account, we discover that all states and federal territories have concave downward slope decreasing line segments for first two segments. The average length of series of downward slope decreasing line segments in all piecewise linear regression is 6.6 months before it turn to concave upward. However, the lengths of these segments could range from earliest 134 th days (or 25 July) for Kedah to latest 234 th days (or 2 November) for Kuala Lumpur, indicating the starting of the third wave of COVID-19 can be relatively vary for three months in different states and federal territories.
We then compare the maximum slope (denoted as S max , which are the slopes in bold font in table 2) and the slope after the latest breakpoint (denoted as S cur , which are indeed the slope of last segments in each row of table 2) in figure 4. Specifically, the vertical axis, S max gives the epidemic growth rate at its worst scenario for a particular state or federal territory. Meanwhile, the horizontal axis, S cur provides the ongoing epidemic growth rate on the cutoff date (31 December 2020) of this analysis. By carrying out distance-based clustering, we find that states and/or federal territories within the same cluster tend to have similar maximum growth rate. It is interesting to note that several geographically neighboring states are close enough in their relative positions in figure 4. These include Selangor and Kuala Lumpur, Terengganu and Kelantan, as well as Perak and Pahang. Also, all six states (namely Selangor, Kuala Lumpur, Johor, Perak, Penang and Melaka) with concave upward slope in their respective last segment appear to be at the rightmost position of this plot.
We could make use of horizontal axis in figure 4 for short term prediction of statewise COVID-19 trajectories. If state A is to the right of state B, then A has a greater potential surge in COVID-19 numbers after the cutoff date. For instance, Johor reported the highest daily cases with 1069 on 27 January 2021 [25], which is more than twice as many as its highest daily case in the year before. However, in view of the rapid changing growth rate of COVID-19 [26], we unavoidably underestimated the infection cases in certain state in such a simple plot of figure 4. As an example, Sarawak has been continuing record  [27] as opposed to its highest recorded daily cases which was just 32 throughout the whole year 2020. We find that Sarawak is at the leftmost position in our plot and this could indicate the delay in the outbreak for the Sarawak state.

Analysis of daily infection cases in 13 states and 3 federal territories
As the cumulative infection cases are highly non-stationary, we further investigate the possible underlying causes of the piecewise linear segmentation by carrying out segment neighborhoods algorithm of analysis of change-point whereby fundamental shifts in statistical properties such as mean and variance for daily infected cases are found. Three types of change-point are sought, namely abrupt  change in both mean and variance (denoted as "meanvar"), as well as abrupt change in mean (denoted as "mean") or in variance (denoted as "var"). Among these three types of change-point, one of these that best tally with its corresponding breakpoints in piecewise regression are chosen. Figure 5 gives the statewise COVID-19 daily infection cases with its best type of change-point.
The time points (i.e. specific dates) where such change-points detected by segment neighborhoods algorithm are listed in table 3. For ease of comparison, we underline the change-points in table 3 that fall within 14 days if compared with its corresponding breakpoint given in table 2. Except for Penang and Perak, all states and federal territories have at least two change-points that tally with its breakpoints. In certain states, the second best type of change-points which gives the same number of change-points tally with its breakpoints is also given in the column "Other" in table 3. Overall, the change in mean and variance (i.e. meanvar) gives the better results in which it suits 7 out of 16 states and federal territories, especially for states which have higher order of magnitude. Despite the fact that the change in mean is very sensitive to outliers, five states have the abrupt change in mean in their daily infection cases that best tally with its breakpoints. We then give the visual connection between breakpoints and change-points through a connected dot plot in figure 6, which is produced by plotting the underlined change-points in table 3 and its corresponding breakpoints in table 2. The positions of pairs of dots illustrate the significant changes in both cumulative and daily infection cases for a particular state or federal territory. For instance, COVID-19 cases trajectory in Sabah could be concurrently segmented into three time windows in which the associated regime switching occurs in the mid of April and early September 2020, respectively. On the other hand, both cumulative and daily cases in Perak and Penang underwent only one significant structural change simultaneously in early October. In a nutshell, for all states and federal territories, the period between 13 March and 31 December 2020 can be subdivided into two to four epidemic phases, whereby the majority of the epidemic phase transitions fall in the month of April and last quarter of the year 2020.

Conclusion
We have utilised piecewise linear regression and change-point analysis to respectively analyse COVID-19 cumulative and daily infected cases in Malaysia for all its states and federal territories. It is clear from figures 2 and 3 that the piecewise linear regression provides reasonably good fit to cumulative infected cases datasets. These, together with the change-point plots in figure 5, discover structural changes in epidemic trajectories with potentially relevant implication for effectiveness of various stages of MCO implemented in the year of 2020.
For nationwide cumulative confirmed cases, through our piecewise regression, we found evidence that the MCO were likely to almost flatten the epidemic curve in just one month after it was first initiated. While for state level, from table 2, we concluded that almost all states and federal territories had concave downward segment slopes in their first two segments. This indicates that the public health interventions contribute to deceleration of the spread of COVID-19 in Malaysia.
Although there was a promising sign that MCO could curb the spread of COVID-19 in Malaysia until third quarter of 2020, the clustering plot in figure 4 is useful for early alerts to some states (or federal territories) that have potential surge of number of infected cases on the cutoff date (31 December 2020). This short-term forecasting demonstrates its promising performance in capturing rapid changing growth rate of COVID-19 in certain state, for instance, Johor.
For daily infection cases, table 3 revealed that more than half of the change-points detected match its corresponding breakpoints in piecewise regression. It appears that almost all structural changes detected by using both approaches agree and give similar estimates for the timing for two to four abrupt transitions for each state and federal territories, as displayed by connected dot plot in figure 6.
The segmentations of COVID-19 outbreak data enable us to quantify the time window required by containment and mitigation measures like MCO to bring the epidemic growth rates from rapid to moderate and to flattening the cumulative curve. This statistical modelling can be used to obtain the direct effects of public health interventions such as MCO on the course of the epidemic but also it can be utilised in deterministic models in identifying the exact point (date) to conduct calibration for parameters of concerned.