Dynamic real estate project cost prediction based on SIMCA-P

with real estate projects as study cases, this study explored the influencing factors (contract term, project visa, project changes, etc.) for dynamic cost control and performed numerical analysis on these factors. The SIMCA-P software was used to perform partial least squares (PLS) regression on the actual cost and a regression prediction model was built. The difference between the checked cost and that forecasted by the model was compared, the reliability of the mathematical models for the residence projects was verified. The research results show that the explanation capacity of the explanatory variables for the dependent variables reached 0.937, which means that all these before-mentioned factors had significant impacts on dynamic cost control. The average error of the predicted residence cost obtained by the model based on actual data was 0.006582346, indicating the high prediction accuracy and that the PLSs regression method performed well in solving the multiple correlations between independent variables. Therefore, the PLS regression method could provide a good solution for prediction of dynamic cost of real estate projects.


Introduction
Dynamic cost is an important part of cost control and runs throughout the whole project. Excellent dynamic cost control can facilitate capital management of enterprises, increase their economic gains, maximize the profit of projects and realize precision cost control. In this study, the dynamic cost analysis was analyzed from three perspectives: project visa, project changes and contract terms.
Project overview: the Northern 2# plot project was taken as the study case. The plot covers a total area of 626,700 m2 and belongs to a tourist resort of the city. The construction site takes up 11,948 m2 and the building occupies 119,489.58 m2. The project was to build 5~7-storied residence buildings, a 2-story public building and underground garages.

Building the model
Many methods and technologies have been used to prediction of dynamic cost control, such as gray correlation prediction, regression analysis, combination prediction, neural network prediction, etc. Regression analysis is a widely used prediction method, but in multi-linear regression analysis, multiple correlations occur among independent variables. In prediction of dynamic costs, the sample data are usually finite, and the models built lack stability, leading to large errors. The PLS regression method, however, provides a solution. The SIMCA-P software is combined with the PLS regression method to perform prediction of dynamic cost analysis in this study. The data used in this study are from the dynamic cost database for the Northern 2# real estate project. Take the residence building programs in the Northern 2# project as an example. Variables including the agreed charges on the contract, the project visa fee, the contract term and the project area were used, and the SICMCA-P software was used for analysis and fitting of variables of calculated costs. The statistics are shown in Tables 1. The contract indicator in the tables refer to the ratio of the actual contract charges and the building area, and the changing indicator is the ratio of the actual changing cost to the building area.

Selection of independent variables.
According to the features of dynamic cost control, three variables, i.e. project visa, project changes and contract term are taken as the independent variables, and the project's actual cost as the dependent variable. Multivariate linear regression was performed on the actual cost and the major indicators. There are issues worth noting in selection of independent variables: first, the major indicators selected in the aforementioned study should be taken into account and the actual visa conditions of the northern 2# project should be considered in determination of the independent indicators; second, the basic data should be changed according to the actual engineering conditions of the Northern 2# project, the quantification capacity and measurability of the independent variable indicator should be fully considered. Given these two and the major control indicators of independent variables regarding the change patterns of the dynamic cost changes in the northern 2# plot, three independent variables that were chosen were: contract indicator (X1), the project change indicator (X2) and the contract term (X3). The building area was the assistant calculation indicator for final accounting.

Analysis of model parameters.
Two numbers of observation N=9 and N=14 were taken for the PLSs models in the SIMCA-P software. The X independent variable in the two models was 3 and the sum of the dependent variables Y was 1. Via model fitting, the PLS models for the residence dynamic costs were obtained, as shown in Figures 1.  Figure 1. Residence cost PLS model Table 2 lists the parameter explanations of the model. The parameters of the model fitting are as follows: • R2X -the changing percentage of the independent variable X in the model; • R2X(cum) -the percentage accumulation of R2X of a given element; • Eigenvalue -the eigenvalue is the multiplication of the number of the independent variable X and R2X; • R2Y -the percentage of the changes of the dependent Y in the model; • R2Y(cum) -the overall cross verification of the elements, mainly used to verify the cross validity of R2; • Limit -the threshold value of Q2; if the value equals 1, the element is insignificant; • Q2(cum) -accumulation of the assigned element Q2. Different from R2 (cum), Q2(cum) is not simple accumulation. Q2(cum) is not simple accumulation, but means the cross validity of the accumulation modeled with m elements (t1,t2,…,tm).
Based on the parameter explanation of the SIMCA-P model, the cross validity of the Y variable was analyzed. As Table 3 shows, when the second element was extracted, the Q2 of the models were 0.979, respectively, while the Q2(cum) were 0.996, which means that the models reached the expected prediction accuracy, thus the partial least squares elements of the models could be extracted. Figures 2 show the fitting histograms that give the results of the regression model fitting. The parameters show that the fitting results were satisfactory.

Outliers of the recognition model.
The outliers of the model data could be recognized based on the distribution map of the t[1]/t [2] ratio. Figure 2 shows the elliptical distribution of the cost analysis model of the residence, and no outlier was there. Therefore, the model fitting effect was good and no change was needed. If outliers occurred outside the ellipse, the outliers should be checked and removed according to the actual conditions.

Correlation verification
To verify the linear correlation the residence cost models, PLS correlation analysis was performed. Figures 3 shows the correlation between all Y variables (u1) and X variables (t1) of the models. Whether X and Y are highly correlated is reflected on the scattering points around the diagonal line. The size of the scattering range was the measurement indicator of the changeability. As the figure shows, the residence model shows strong linear correlation.

Verification of model overfitting
To avoid model overfitting, 200 combination sets of the two models were analyzed. Figure 4 shows that of the residence model. The results show that the prediction results of the original model was effective. The standards to verify the effectiveness of the model was: all blue R2 values on the left were lower than the initial values on the right, or the blue regression line of Q2 crossed with the vertical axis (left) or was lower than the abscissa. The R2 value indicates the feasibility level of the model. When all green R2 values on the left were lower than the initial values on the right, the original model was proved effective.

Correlation analysis between the independent variables and the dependent variables
By calling the command Plot/Lists→Lists in the software, we obtained the standard PLS model of the actual cost and the three independent variables.
• Actual residence cost = 0.826639×contract indicator× building area -0.397194×chanigng indicator× building area -0.51684× contract term (1) The regression coefficient graph shows the explanation of the independent variables for the actual cost, as

Importance of the independent variables in explaining the dependent variables
The VIP analys is table. The VIP value could reflect the importance of the independent variables in explaining the dependent variables. The VIP value is the weighted sum of squares of the proportion of each independent variable X. the sum of squares of all VIPs equal the number of terms of in the model. Therefore, the average VIP is 1. If the VIP value exceeds 1, it means the X variable is "important", and if the value is below 0.5, the variable is considered "unimportant". The area between 0.5 and 1 is a gray area, and the importance level depends on the size of the data set. The VIP diagram is sequenced from high to low, and shows the confidence interval of the VIP values, which is usually 95%.  Table 3 shows the VIP values of each independent variable. The importance value of X1 variables in the M1 model exceeds 1, and there is no large difference. In Model 1, the difference between the X2 variable and X3 variable is between 0.5 and 1, which means that the three factors, i.e. the project change, the project visa and the contract term, have strong impact on the dynamic cost and are the major reasons for the changes in the operating cost. In other words, increases of these factors will cause increase in the project target cost. In the residence VIP graph, the VIP value of the changing indicator is 0.486507, which is correlated to the large expenditure on the underground steel concrete in the preliminary building stage.
Analysis of the models shows that building a prediction model using the PLS method to predict the dynamic cost performs well in solving the multiple linearity problem among independent variables, and the built model has strong prediction capacity.

Verification of the model results
The prediction model equations 1 is used to predict the dynamic cost of the projects. By substituting the actual data into the equations, we obtain the difference between the predicted value and the actual value. The relative errors of the predicted values are shown in Tables 4.  Table 4 shows that the relative error of predicted values of all engineering sites is small. Some items of the cost prediction may exceed the expected range, but the overall cost prediction error remains small, thereby proving the feasibility of building a model.

Conclusions
Based on the data including the dynamic cost, project changes, project visa and contract term of residence projects from the third season of 2018 to the fourth season of 2019, a PLS regression prediction (1) The dynamic cost of the project is subject to the impacts of the contract indicator, the project change, the contract term and other factors. Correlation verification shows that these factors are highly correlated. The PLS method can solve the problem of multiple linearity.
(2) In the dynamic cost prediction model built based on the PLS method, the explanatory capacity of the explanation variables reaches 0.979, And the relative error of predicted values of all engineering sites is small. This is a high accuracy, which means that the PLS regression method solves the multiple correlation problem among independent variables. Therefore, using the SIMCA software and the PLS regression method is a good solution to predict the project's dynamic cost.