Modeling urban household income in Malaysia using parametric approach

The parametric approach is commonly used to estimate the actual distribution and thus measure income inequalities while analyzing the income distribution. The parametric estimate of the income distribution is useful as it helps in inferences about the estimated inequalities to be made more easily. Centered on household income survey data from 2007, 2009, 2012, and 2014, a comparative assessment of six parametric distribution candidates that represent urban household income in Malaysia was conducted in this research. The gamma, lognormal, Weibull, Dagum, Singh – Maddala, and generalized beta of the second kind distributions were some of the two, three, and four parameters considered in this study. Based on our analysis, it was found that the generalized beta of second kind distribution was the most adequate model for explaining the urban household incomes in Malaysia. Then, on the basis of the generalized beta of second kind distribution, income inequality analysis of the urban households in Malaysia was assessed using Lorenz curve and Gini coefficient.


Introduction
Human life and survival depend on income [1]. Simple services and needs, such as food, lodging, schooling, and clothes, can be obtained with the money earned [1]. Commonly, nonetheless, an individual's wealth varies, and the resulting income disparity in society will contribute to income inequality. The consequences of income inequality have been extensively explored in the literature. A high level of income inequality, which is caused by broad income distribution, leads to a country's sluggish economic growth [2,3]. Furthermore, research on the relationship between income inequality and crime has discovered that income inequality is positively correlated with crime, including violent crimes like burglary and robbery [4,5]. As a result, attempts to minimize wealth inequality become critical to a country's overall well-being and economic growth.
The Department of Statistics, Malaysia (DOSM) has carried out income distribution analysis through an official survey known as the Household Income Surveys (HIS) [1]. The HIS' key goal is to gather data on household income and social background to assess the Malaysian population's economic wellbeing. The data and statistics obtained are then used to establish policies and economic plans for the world, especially for eradicating poverty and developing strategies for equal distribution of income. To calculate income inequality among Malaysian households, the DOSM usually employs an analytical approach based on the Lorenz curve (LC) and the Gini index. The DOSM's solution, however, ignores the heavy-tailed properties that occur in the upper part of the income distribution [6]. For modeling income data, the parametric approach is commonly used to overcome this issue. The use of a parametric distribution to model income data has many benefits. Instead of describing an entire curve, a parametric distribution can be defined using only a few parameters [7]. Furthermore, such a distribution will capture the heavy-tailed property in the upper part of the income distribution, making further research on income inequalities easier [8][9][10][11][12]. Kleiber and Kotz [7] provided a thorough overview of parametric income distribution models including the Pareto, gamma, lognormal, Weibull, Dagum, Singh-Maddala, and generalized beta of the second kind (GB2) distributions. Some scholars have considered the semiparametric method, which combines the empirical distribution and the Pareto model to explain both the lower and upper portions of the income distribution [13,14]. The parametric approach based on the Pareto and reverse Pareto distributions are used to model upper-and lower-tailed data for Malaysian household income data, respectively, whereas the semi-parametric approach is used to describe the whole income distribution [6,[15][16][17][18][19][20].
The strata can be divided into urban and rural areas using the HIS. A gazetted area with adjoining built-up areas with a combined population of 10,000 or more is known as an urban area [1]. Rural areas, on the other hand, are defined as gazetted areas with a population of fewer than 10,000 people and nongazetted areas [1]. The objective of this research is to use a parametric method to explain the income distribution of urban households using HIS data from 2007, 2009, 2012, and 2014. The gamma, lognormal, Weibull, Dagum, Singh-Maddala, and GB2 distributions consisting of two, three, and four parameters were used in this research. Furthermore, this study utilized the LC and Gini coefficient to measure income inequality among urban households based on the best parametric model.

Source of data and descriptive statistics
The monthly gross income of urban households in 2007, 2009, 2012, and 2014 was collected from the DOSM and used in this research. The DOSM collects the data through the HIS, which takes place twice every five years. The mean, median, deviation, maximum and minimum skewness, and kurtosis are all included in table 1's descriptive statistics for urban household income. Furthermore, table 1 indicates a growing pattern demonstrated by the mean and median, showing that urban household income rose overall from 2007 to 2014. For all years, the mean and median values are ranged between 59.17-375.00 and 70400.00-186892.00, respectively. These figures show that the distribution of urban household income in different years is incredibly broad, suggesting that the data is highly dispersed. Per year, the variances in urban household income are also high, suggesting that the data is dispersed widely across the mean. All of the skewness coefficients appear positive, meaning that the urban household income distribution does not obey a normal distribution and skews to the right. All of the distributions of urban household income exhibit a heavy-tailed property, as shown by the high value of kurtosis coefficients. 3 likelihood estimator (MLE) was used. All of the models are defined below, including the probability density function (PDF), cumulative distribution function (CDF), and estimation of parameters based on MLE.
3.1.1. Gamma distribution. The gamma distribution's PDF and CDF can be described as following [21]: where the shape and scale parameters are presented by a and b, respectively. Furthermore, Г(•) denotes the gamma function while ( , ) γ   denotes the partial gamma function. The gamma distribution's log likelihood function was defined as follows: The MLE for b is given by By substituting b into the log likelihood function in equation (3), the MLE for parameter a was derived. After that, the log likelihood function was then refined in R software utilizing the optimization function.

Lognormal distribution.
The lognormal distribution's PDF and CDF are demonstrated as following [21]: where the scale parameter is , the shape parameter is , and the CDF of the ordinary normal distribution is Φ(•). For parameters and , the MLE is as follows: The Weibull distribution's corresponding PDF and CDF are as following [21]: where β denotes the scale parameter and represents the shape parameter. The Weibull distribution is also described as the stretched exponential distribution in the literature [22]. The Weibull distribution's log likelihood function is as follows: The log likelihood function in equation (11) was optimized by disregarding it and calculating its minimum to obtain the MLE for parameters and β. The optimization function in R software was used to perform this operation iteratively.

Dagum distribution.
Camilo Dagum suggested the Dagum distribution for modeling income data [23]. The Dagum distribution's accompanying PDF and CDF are as following: The shape parameters are represented by a and p, while the scale parameter is denoted by b. For a, b, and p, there is no closed form of MLE expression. The log likelihood function of the Dagum distribution was optimized using the optimization function in R software to produce the MLE for a, b, and p. The Dagum distribution's log likelihood function is as follows: 3.1.5. Singh−Maddala distribution. Singh and Maddala [24] proposed the Singh-Maddala distribution, which has gotten a lot of coverage in the literature of income distributions and is also regarded as the Burr distribution [23]. The Singh-Maddala distribution's PDF and CDF are as described in the following: where the shape parameters are presented as a and q, and the scale parameter is denoted by b. The MLE for parameters a, b, and q was calculated by optimizing the log likelihood function in R software with the optimization function. The Singh-Maddala distribution's log likelihood function is given as follows:: 3.1.6. GB2 distribution. McDonald [25] introduced the GB2 distribution, which is generally regarded as an excellent definition of income distribution. The following are the PDF and CDF for the GB2 distribution: where a, p, and q are the shape parameters Note that B(p, q) = Г(p)Г(q) / Г(p + q).

Model selection criteria
The coefficient of determination (R 2 ) and Akaike information criterion (AIC) was used to find the best formula for representing urban household income. Under an expected distribution, R 2 measures the degree of association between observed and theoretical probabilities [19]. An R 2 value close to 1 means that the assumed theoretical model suits the data well. The R 2 can be calculated using the following formula [19]: where Fn(xi) is the empirical cumulative probability for the i-th household income data, � (xi) is the estimated cumulative probability for the i-th household income data under an assumed theoretical model, and � (x) is the average of � (xi).
The AIC calculates how much information a model loses in comparison to other models. The lower the volumes of information lost by a model, the higher the model's accuracy. Simply put, the one with the lowest AIC value would be chosen. The AIC can be determined using the following formula [26]: where L is the likelihood and k is the number of parameters of the fitted model.

Parametric Lorenz Curve and Gini index
The LC is a graphical instrument that can be used to measure income inequalities in a target group [27]. The more the LC deviates from the 45° line of equality, the more unequal the income distribution [27]. The Gini index is the ratio of the area between the 45° line of equality and LC to the area of the triangle below the 45° line of equality. The Gini index varies from 0 to 1, with 0 indicating perfect income equality and 1 indicating perfect income inequality [28]. The following are the definitions of the LC and Gini index [27,28]: where E(x) represents a specific distribution's mean, F -1 denotes a specific distribution's quantile function, and f(x) represents a specific distribution's PDF. Based on the parametric distributions investigated in this research, table 2 shows the parametric form of the LC and Gini index.    For all of the models considered, table 3 displays the parameter estimates, R 2 , and AIC values. The fitting of all the considered distributions to the urban household income distribution can be seen in figure  1. In table 3, the subscript symbols "a" and "b" denote the highest and lowest R 2 and AIC values, respectively. In comparison to the other versions, the GB2 distribution has the highest R 2 and the lowest AIC, as seen in table 3. The GB2 distribution was observed as the most appropriate model for representing urban household wealth, according to these findings. Furthermore, the R 2 values for all years using the GB2 distribution were greater than 0.99, implying that the GB2 model explained more than 99% of the variance in the results, while the residual (less than 1%) variation was due to errors and cannot be explained by the model.    Figure 2 shows the LC of urban households on the basis of the GB2 distribution for all of the years considered. Table 4 summarizes the share of overall income shared by three groups: the bottom 40% (B40), the middle 40% (M40), and the top 20% (T20). Figure 2 and table 4 indicate modest but steady growth in the proportion of income gained by the B40 community from 14.85% to 16.89% over the period of study. The percentage of income gained by the M40 party fluctuated ranging from 35.84%-36.99%, showing no consistent trend. In 2007, the T20 party obtained 49.31% of total income, compared to 46.40% in 2014, signaling a small decline in the percentage of total income gained. Thus, it can be concluded that the income distribution of urban households changed during the course of study based on the rise and decline in income shares of the B40 and T20 classes, respectively. However, there was still a significant gap in the proportion of income earned by the T20 and B40 categories, reflecting Malaysia's income disparities among urban households.   Table 4 illustrates the estimated Gini coefficients of urban households based on the GB2 model. From 0.4343 in 2007 to 0.3966 in 2014, the Gini coefficients displayed a downward trend. According to these figures, only 56.57% of urban households shared gross household income in 2007, while the remaining 43.43% received none. In the meantime, in 2014, the index of income segregation fell, with 60.43% of urban families sharing gross household income while the remaining 39.66% receiving none. These results support the notion that urban household income distribution has changed over time.

Conclusion
This research has compared six parametric distribution options for representing urban household income in Malaysia, including the gamma, lognormal, Weibull, Dagum, Singh-Maddala, and GB2 distributions. The monthly gross income of urban households in 2007, 2009, 2012, and 2014 comprised the dataset used in this report. Meanwhile, the GB2 distribution was considered the most reliable model for representing urban household income, according to this research. Further, this study has used the LC and Gini coefficient to measure income inequality among urban households focusing on the GB2 distribution. According to the equipped LCs, it was discovered that during the course of the study, from 2007 to 2014, the B40 group's share of total urban household income increased marginally, while the T20 group's share of total urban household income decreased slightly. Nevertheless, there was no discernible difference in the M40 group's share of overall urban household revenue. Finally, it was estimated that urban households in Malaysia observed a declining trend in income disparity based on the calculated Gini coefficients.