Identification of Variables Affecting Levels of Salt Concentrations in Shatt Al-Arab Water Using Modified Kernel Principal Component Analysis

This study paper is an attempt to bring to light a new approach in the treatment of the Gaussian function. The Gaussian function is considered the basis for building the elements of the kernel matrix within the methodology of the kernel principal components that aims to reduce the dimensions and then determine the most influential variables. Besides, it works on reducing the mathematical complexity that can arise because of multidimensionality, especially if it is the data suffers from a non-linear problem in describing the relationships. This research paper has included processing the introductory parameter matrix (H) by adopting two types of matrices, namely (H 1 diagonal, and H 2 hybrid diagonal). For achieving the benefit of this paper, it has been applied to the phenomenon of salt concentrations in Shatt al-Arab water in Basra Governorate through a number of climatic variables for identification of the most influential variables. The modified Gaussian function (MGK) was used and compared with the traditional method (TGK) by adopting two methods of estimating the introductory parameter matrix H. The simulation results brought to light that the (MGK) could not achieve better results than the (TGK) for any type of matrices (H 1 & H 2), estimated by the two methods (NS-R, ROT). Despite this fact, the (MGK) and (TGK) were consistent in determining the climatic variables most affected by the rise in salt concentrations when adopting the (NS-R) method, which are (air temperature, minimum temperature, maximum temperature, and solar brightness). While when adopting the (ROT), the (MGK) determined other variables, while the traditional method identified the same variables mentioned above.


Introduction
First of all, the non-linear trait of the data is the real characteristic that faces most of the data of phenomena, especially natural phenomena that are characterized by complexity and overlap in their relationships.Therefore, the study of these phenomena in light of multi-dimensionality is complex; many researchers have addressed high non-linear problems by adopting a variety of methods, including neural networks [1].For the purpose of minimizing these problems and their complexities, non-linear high-dimensional data will be processed by adopting the method of Kernel principal component analysis (KPCA) by building a kernel matrix that depends mainly on the distance between the observed values and the introductory parameter (h), where the Gaussian function was adopted to obtain the kernel matrix.It has made various contributions that have played a role in the development of these concepts.The study executed by [2] aimed at presenting the arbitrary kernel method of principal components and fuzzy cluster FCM in non-linear data processing, through which it reached the accuracy of classification of training data.
Secondly, a study carried out by [3] made a reference to the possibility of using principal component analysis and clustering of hazy media in evaluating air quality monitoring stations.A study conducted by [4] included the use of the method of principal components analysis and cluster analysis to analyze the sediments in the Hammar Marsh water through a group of chemical elements.This paper was capable of identifying these factors and compare them with the standard of environmental enrichment.Results were consistent.
Besides, the study of [5] presented an analysis of environmental factors affecting water quality in Erbil Governorate, especially the water of the Upper Zab, by adopting the methodology of the analysis of the principal components.[6] disclosed a study on the processing of higher dimensions of nonlinear data through the use of the KPCA methodology and a comparison between the capabilities of the introductory parameter in the formation of the Kernel matrix.The core kernel radial principal function was adopted in the estimation of the Kernel matrix.
Also, a study carried out by [7] looked for finding a mechanism for processing cerebral nerve signals and improving the exploratory structure for the purpose of diagnosing disorders.The KPCA algorithm was used with clustering algorithms K-means, FCM, Bayesian and Fuzzy maximum likelihood estimation.The study brought to light the strength and distinction of the method KPCA used to define the variables, while FCM was used to improve the data collection structure.On the same token, [8] were on a study and comparison between the PCA method and the KPCA method in processing the higher dimensions of satellite images of the Shatt al-Arab and percolating channels in Basra Governorate and the surrounding areas.[9] were able to present a hybrid algorithm by linking the analysis of kernel principal components with the developmental algorithms (Maximum Learning Machine) (ELM) to address the wastewater problem from the point of view of environmental engineering and public health.
Moreover, a study performed by [10] presented a high-dimensional non-linear data processing by adopting the KPCA method, where two types of kernel functions were used in estimating the kernel matrix.The study came up with, through experimental simulation, the larger the sample size, the greater the number of variables.[11] passed on a study aiming at presenting an approach that contributed to nonlinear data processing and kernel principal components analysis KPCA.There was an objective to make use of the inertia model with KPCA entropy to reach the KEPCA principal components processing algorithm that got more accurate results in the formation of nonlinear principal components.
Through the foregoing, there can a contribution to this study paper by proposing a kernel function that can be compatible with all types of introductory parameter matrix.Two types of introductory parameter matrices (diagonal, and hybrid diagonal) were adopted.As for the method of estimating the introductory parameter, two methods were adopted: the traditional normal distribution method, and the rule of thumb high order derivatives.

Methodology and Scientific Material
In this part of study, general concepts about KPCA kernel principal component analysis method will be discussed.

Kernel Principal Component Analysis (KPCA)
Overlap and confusion in the interpretation of the relationships between the variables with the presence of higher dimensions for those variables makes the data behave non-linear behaviour [12].Under these phenomena, it is difficult to build a model through which the phenomenon can be simulated.For the purpose of addressing the problem of higher dimensions and reducing it, the KPCA method was used for being able to identify basic non-linear components with the identification of variables with a significant impact [13].
The finding of principal components is based on the assumption that there is a nonlinear transfer function (  ) that contributes to converting the higher dimensions of the original data defined  ∈   into data with a smaller number of dimensions(), in other words: (  ):   →  ;  →  Therefore, for the purpose of estimating the principal covariance matrix  ̃ we assume that the sum of the deviations of the values with respect to the transfer function is defined according to the formula It should also be noted that the function (  ) is an unknown non-parametric function that must be estimated.Accordingly, the formula for estimating the principal kernel covariance can be obtained: [14] Accordingly, the kernel principal components model can be established pursuant to the following formula: ̃= ( ̃ , (  )) For the purpose of arriving at the principal kernel components, an appropriate formula must be determined for the core eigenvectors e ̃ and the core eigenvectors ̃, which are as follows: Substituting formula (2) into (4), we obtain the following:  ̃ ̃= ( Where formula ( 5) is created for finding the core eigenvectors, and (∝i) are coefficients of the function (  ).Since (  ) is a non-parametric function, therefore, a formula must be found through which this function can be dealt with, and accordingly both sides are multiplied Formula (3) with (  ), which represents the transfer function for the sub-conditions within the low dimension, to obtain the following formula [15]: ̃ (  )  ̃= (  ) ̃ ̃ By compensation for  ̃ ,  ̃with their equivalent, and setting an equivalent formula for  ̃, as we note that all solutions fall within the range   ,   , … ,   , so if a part of this range is taken as  = 1,2, . .,  then it fulfills the eigenvector formula and we get [16]: Through formula (6), it is clear that calculating the eigenvalues will be difficult to reach within the space of the function (. )as it is a non-parametric function and calculations cannot be performed easily.Therefore, this problem is addressed by using the Kernel matrix by taking advantage of the formula 〈(  )(  )〉 Therefore, formula (6) can be interpreted as: (7) Assuming that   ,      are equivalent as they result from the same area of space, so we can put the formula (7) in the following form: When simplifying, we can reach the formula for calculating eigenvalues and eigenvectors by adopting the Kernel matrix, which is described as follows: Where K is a square matrix of degree ( × ) and is symmetrical.It helps to calculate the eigenvalues by calculating the distances between data points.It is called the Kernel matrix (the principal matrix), as shown below:

Kernel Density Estimator
The probabilistic functions are basically one of the basics of probability distributions.However, these functions differ according to the nature of the method, some of them are parametric and others are nonparametric.The non-parametric approach is based on the principal of not knowing the function form.Therefore, the estimator of the kernel density  ̂() will depend on data and Kernel Function [17], and knowing according the formula: The above formula speaks for a one-dimensional kernel density estimator, but if we assume that we have a vector  = [  ,   , … ,   ]  that has () dimensions and that for each dimension () of observations  = [  ,   , … ,   ]  , then we can get a kernel density estimator suitable for multivariate condition [18]: |H|: Determinant of the introductory parameter matrix, and that () is a positive diagonal square matrix, and can take forms.
In order to reach an efficient density estimator, its consistency must be determined by adopting one of the consistency criteria, as the introductory methods depend mainly on how close the estimator is to the original function, and from these criteria (MSR, MISR, AMISR), [19].

Estimation of the smoothing parameter matrix
The estimation of the smoothing parameter matrix is one of procedural aspects the parameter in obtaining the kernel matrix.Besides, it subsidize to determining the estimator of the kernel density.It also devotes to creating a state of balance between the variance.Therefore, the two traditional normal distribution methods and the rule of thumb high order derivatives were adopted in estimating and adjusting the introductory parameters.They are as follows:

Normal Scale Rule Method
The method of normal scale rule of thumb is one of the simple and clear methods for estimating the introductory parameter.It relies on reducing the measure of AMISE( ̂(x))and its formula is interpreted as follows: Where: ℎ   = .   −   (16) ℎ   = .  ̂  −   (17)

The rule of thumb high order derivatives
The handling of high order derivatives in determining the appropriate formulas to adjust the introductory parameter is one of the procedural aspects to address the problem of higher dimensions of variables (the curse of dimensions), which was diagnosed by (Bellman) in 1961.It assists in reducing the mathematical difficulties facing the formulas, and thus good introductory parameters are reached [20].If we assume that we have (r) of the orders of the higher derivatives and (v) that stands for the order of Tyler expansion, then we can reach a formula for the kernel density estimator through which we can adjust the introductory parameter matrix by adopting the AMISE scale, as follows: And using the rule of Tyler expansion By referring to the properties of the kernel density, function: Following the distribution of the expression  ,ℎ  (  ) to a Taylor two-bracket, we can get the result: By following the same method of derivation above, we can obtain the variance formula for higher orders of derivatives, as follows: )] 2 ) Performing the transformation to simplify the formula, taking into account the properties of the Kernel function that leads to ∫  () = 0with the use of Taylor's two brackets on  () (  − ℎ    ) at the first order, we get [21]: Following taking The ℎ   formula represents the formula for obtaining the optimal smoothing parameter by processing the amount ∫ (∇ (+) (  )) The Gaussian Kernel Multivariate function which is compatible with the ROT method [22, p. 91] will be adopted, and by substituting the results of this    (  ) ،(  )  into formula (21) we get the formula ROT created by Scott [23] and described below: (22) Whereas:  ̂:: represents the standard deviation for each of the dimensions, and through this paper, two types of smoothing parameter matrices will be adopted [24]: 1. ( 1 ) diagonal matrix defined by smoothing parameters (ℎ  )for each of the dimensions, its elements are defined as follows: 2. ( 2 ) hybrid diagonal matrix with smoothing parameters weighted by the summation sampling variance ( ̂2ℎ  ), whose elements are defined as follows: Where: For purpose of obtaining the kernel matrix, will be using the gaussian kernel function that taken up in its traditional and proposed state, as shown in the table below:

Experimental Methodology "Simulation"
It is of importance to mention that simulation is one of the significant software methods through which we try to simulate realistic systems.Therefore, it is defined as "the process of building a system similar to the real system and then feeding it with information to reach results close to reality", In addition to the simulation method, there are artificial intelligence algorithms and fairy algorithms that can be used to study complex systems [25].Hence, the experimental simulation was carried out by taking the dimensions (q = 8, 15, 20, 30) with sample sizes (n = 20, 50, 100, 200) and we reached the following results:

Analysis for KPC by NS-R Method
The kernel principal components were analyzed according to the normal distribution method in estimating the introductory parameter h as a constant for all cases and forming the matrix H 1 according to formula (22), and then estimating the kernel matrix according to the traditional Gaussian function according to the formula (26) and the modified Gaussian function according to the formula (27 The results of table (3) bring to light a summary of the analysis of the principal kernel components, whose methodology was based on estimating the H1 diagonal introductory parameters matrix by adopting the normal distribution rule (NS-R) method.Results were in a comparison between the results of the Kernel matrix (, ) by adopting two formulas for its calculation.It is the formula of the traditional Gaussian Kernel function (TGK) according to formula (26), and the modified Gaussian Kernel formula (MGK) defined by formula (27).The principal kernel components and the contribution ratios were obtained for each dimension of the dimensions and sizes of the samples.It was precisely evident that the modified formula could not achieve better results than the traditional formula, except in certain cases, as we note when q = 8 and n = 20, the (MGK) formula achieved a contribution ratio of (0.89703) and a number of (4) components, while the (TGK) formula achieved a contribution ratio (0.83497) with a number of components (3), and at q = 15, 20, 30 and n = 50 also achieved cumulative contribution ratios (0.89926, 0.88143, 0.89096) higher than the formula (TGK), while the rest of the cases were preferred to the formula (TGK).

Analysis for KPC by ROT
The kernel principal components were analyzed according to the rule of thumb method for higherorder derivatives in estimating the smoothing parameter h as a constant for all cases and forming the matrix  1 defined according to formula (23) and the hybrid matrix  2 defined according to formula (16) and then estimating the Kernel matrix according to the function The two definitions are according to the formula (26 -43), respectively, and the results were obtained as in table (4): Source of results: Prepared by the researcher according to the outputs of the Matlab V.2016.
The results of table (4) summarize the analysis of the kernel principal components, whose methodology was based on estimating the matrix of the introductory parameters through the method of the rule of thumb for the higher-order derivatives.The hybrids referred to in the formula (24), where the results disclosed that the formula (MGK) was able to achieve good results in the case of the diagonal matrix ( 1 ), as it achieved cumulative contribution ratios (0.89703) when q = 8, n = 20 and with a number of compounds (8 ).While the (TGK) formula worked out a cumulative contribution ratio of (0.86321) with a number of (4) compounds.As noted when (q = 8, 15, 20, 30) and (n = 50), the (MGK) formula was also the best in its contribution ratios cumulative as it accomplished (0.86020, 0.89926, 0.88143, 0.89096) and with higher component numbers also (8,9,10,14) respectively according to dimensions (q) and when (q = 15, 20, 30) and (n = 200) we note that (MGK) was also the best in its cumulative contribution ratios, as it achieved (0.89292, 0.89066, 0.89447) and with relatively close numbers of components (10,20,21), respectively, according to dimensions (q), but in the case of the matrix ( 2 ), the results showed the formula (MGK) was not good, but there are cases in which results did not appear, as in the case of (q = 15, n = 20), (q = 30, n = 50), (q = 30, n = 100).From The above results, it can be considered that the formula (MGK) is good in the case of the matrix ( 1 ) estimated by the ROT method, but it cannot be generalized to the rest of the methods.

Practical Aspect
In order to uncover the importance of this method and to achieve the goal of this research paper, this study was applied to a significant phenomenon of the paramount natural phenomena that affect life in all its forms.It is also considered as one of the indicators of water and environmental sustainability, namely the salt concentrations.It was represented by (8) climate variables (air temperature X1, min_temperature X2, max_ temperature X3, relative_humidity X4, wind speed X5, amount of rain X6, amount of evaporation X7, solar brightness X8) for Basra and Al-Fao stations on a monthly basis for the period (2010-2021).The data was obtained from the General Authority for Meteorology and Seismic Monitoring, and due to the disparity in the results of the proposed formula on the experimental side versus the traditional one, we decided to apply the two formulas on the applied side, according to the matrices of the smoothing parameters ( 1 ,  2 ).

Testing Suitability of the Data for Analysis
At this stage, the suitability of the climatic data for analysis is tested according to KMO and Bartlett statistics.The results are as shown in Table (5)  As table (5) disclosed, the data are highly suitable for conducting the analysis of the principal kernel components, as the (KMO) scale recorded an appropriate ratio of (0.9733), and this was confirmed by the Bartlett Shpericity test, which was confirmed by the Chi-Sq statistic.It recorded (2329.349),which brought to light the relevance and accuracy of the studied data at a significant level (5%).Table (6) made it clear that the modified formula (MGK), according to the matrix ( 1 ) and as estimated according to the (NS-R) method, achieved a cumulative contribution ratio (0.82508).This stood for the cut-off value that identified (6) important major components from within ( 8) components achieved.On the other hand, the traditional formula (TGK) was able to achieve a cumulative contribution ratio (0.87680), which represented the cut-off value that was able to identify

Table (4) Test Suitability of Data for Analysis
Table (5) illustrates the number of principal components achieved using the modified method by adopting the H1 matrix.
(2) components from among (8) components ( 6) components.Here, we can conclude that the (MGK) formula ), although it obtained a small cumulative percentage, but it determined the value of a cutoff that contributed to the inclusion of the largest number of components.It is possible to reduce the deletion of variables.As for the most influential variables, they were determined according to formula (28) as follows: Source of results: Prepared by the researcher according to the outputs of the Matlab V.2016.
Table (7) brought to light the saturation percentages for each variable in the resulting components.The formula (MGK) has revealed the possibility of keeping the largest amount of variables and excluding the unimportant variable.It is noted that the two formulas have been kept on some variables (X1, X2, X3, X8) in proportions Acceptable saturation, which is (X1 = 0.45 & 0.48), (X2 = 0.45 & 0.47), (X3 = 0.44 & 0.48), and (X8 = 0.42 & 0.38).However, we have observed that the results of the formula (TGK) have determined the cut-off value at matrix Z2, while the formula (MGK) have determined the cut-off value at Z6.This gives space for choosing the largest number of variables.Table (7) states the number of times each variable appears according to each formula, as follows: Source of results: Prepared by the researcher according to the outputs of the Matlab V.2016.
As reflected in table (8), according to the MGK formula, that the variable X6, which stands for the amount of rain, has been excluded, noting that this matter is logical because rain in the summer does not occur and therefore it cannot be studied throughout the year.6) exhibits the degree of saturation for each variable according to its association with ( ) matrix

Table (8) Number of kernel principal components and explanatory ratios according to sample sizes and dimensions of variables using the method of ambiguity for higher derivatives in obtaining the introductory parameter
As far as t ( 9) is concerned, it is a summary of the results of the analysis of the main components according to the ROT method by adopting a matrix (  1 ), as the traditional formula (TGK) achieved a cumulative contribution ratio (0.87680).It realized a cut-off value at component (2), meaning two components were chosen from among (4) verified cases.However, we note that the (MGK) formula has achieved a cumulative contribution ratio of (0.81841) in the same matrix, and has achieved a cut-off value for the (Z6) component.It means (6) components have been selected from among (7) verified cases.As for the most influential variables, they were determined according to formula (28) as follows: Source of results: Prepared by the researcher according to the outputs of the Matlab V.2016.

Conclusions and recommendations
This study has come with the findings that from the foregoing that the modified formula (MGK) was not the best in all experimental and applied cases.Proportions revealed that the formula (TGK) was the best in data processing.The largest number of components and the removal of unimportant variables.It is also concluded that the diagonal matrix (  1 )had better results than the hybrid diagonal matrix (  2 ) when analyzing the kernel principal components; the paper was able to identify the most

2 𝑑𝑥the
unknown using the normal distribution and taking advantage of some of its properties that it has an infinite polynomial at The higher orders of derivatives are Herimate Polynomial, where Henderson et al. demonstrated in 2012 the possibility of finding a formula to estimate the introductory parameter, explained as follows:    = [     ++ (!)  (  )  ()   (  )[(‼ + ( − )(‼)  ] ] : Prepared by the researcher according to the outputs of the Matlab V.2016.

Table ( 1) Gaussian Kernel functions formulas
(22)et the introductory parameter h according to formulas(17)and(22). 3. Calculation of kernel functions (  ,   ) ) using formulas (26) and (27).4. Calculating the kernel matrix  (×) . 5. Calculation of the core eigenvectors and values of the matrix K. 6. Identify the main compounds   .7. Determine the most influential variables as per formula.

of results: Prepared by the researcher according to the outputs of the Matlab V.2021 program.
(3)he results indicated in table(3)were obtained:

of results: Prepared by the researcher according to the outputs of the Matlab V.2016.
below: