Feature engineering based intelligent wireless propagation model for RSRP prediction

Wireless propagation model is of great significance for accurate 5G network deployment. Based on the data set provided by Huawei ModelArts platform, this paper uses the typical method of feature engineering to establish the intelligent wireless propagation model based on back propagation (BP) neural network. The first is feature design, which has designed 12 features from the traditional Cost 231-Hata empirical model and geometric location. Considering the coupling relationship between the features, the angle and distance related features are dimensioned. The second is data processing, using logarithmic transformation, One-Hot coding, missing value processing, etc. Then there is the feature selection, and the final feature set of the model is extracted by the standard deviation, the quantized value of the Pearson coefficient and the feature-reference signal receiving power (RSRP) scatter plot. Finally, the prediction model is built and verified. In order to compare the superiority of the designed features, this paper established and tested three BP neural network models with the same network structure but different inputs to predict the RSRP of different geographical locations. The experimental results verified the superiority of the designed feature set. And a suitable wireless propagation machine learning model was obtained.


Introduction
The wireless propagation model predicts the radio wave propagation characteristics in the target communication coverage area, making it possible to estimate the coverage of the cell, inter-cell network interference, and communication rate [1], which is very important for operators to deploy accurate 5G network [2][3].
According to research methods, the existing wireless propagation models can be divided into empirical models, theoretical models and improved empirical models. The most representative ones are Cost 231-Hata, Okumura, Volcano, SPM, etc. [4][5][6]. Empirical models are often not sufficiently accurate in practical use, so it is necessary to modify the empirical model formula by collecting a large number of engineering parameters and the measured values of reference signal receiving power (RSRP).
Wireless LTE networks have become popular around the world, with billions of users generating vast amounts of data all the time. These data can be used reasonably to assist in wireless network construction [7]. In addition, in recent years, big data-driven AI machine learning technology has made great progress [8]. And with the development of parallel computing architecture, machine learning technology also has the capability of online computing. Its high real-time performance and low complexity make it possible to integrate with wireless communication [9].

Introduction to the data set
This paper uses the data set provided by Huawei ModelArts platform, including training set, test set and verification set, for training and testing of AI algorithm model. Because feature engineering mainly uses training data sets, the training data sets are described in detail below.
The training data set contains a total of 4000 files. Each row of each file represents the relevant data of the fixed-size test area in the cell. The number of rows is between 2220-3800, and the number of columns is fixed to 18 columns. The first 9 columns are the engineering parameter data of the site. The middle 8 columns are map data, which records the topography information. The last column is the actual measurement result of RSRP as the tag data for training. In order to facilitate data processing, the map is rasterized, and each grid represents a 5m × 5m area. Table 1 shows one row as an example:

Feature design
The essence of feature engineering is to convert the parameters from the original data to what best represent the target problem, and make the dynamic range of each parameter within a relatively stable range, thus improving the efficiency of machine learning model training.
In this paper, the feature design is carried out from two aspects: traditional Cost 231-Hata empirical model and geometric location.

Designing features based on Cost 231-Hata model
The parameters involved in the classic Cost 231-Hata model can be included in the scope of feature engineering, which is defined as follows: = 46.3 + 33.9 lg − 13.82 lg ℎ − + (44.9 − 6.55 lg ℎ ) lg + Where is defined as propagation path loss (dB), is carrier frequency (MHz), ℎ is effective height of base station antenna (m), is height correction item of user antenna (dB), and is the 3.1.3. Feature three: the height of the user receiver relative to the ground ℎ . Considering that the actual range of the user's activity in the actual situation can only be on the ground or in the building, the effective height of the user antenna on the grid (X, Y) does not exceed the height of the building there, so the effective height of the user antenna can be corrected as follows: Where is a continuous variable and 0 ≤ ≤ 1.

Feature four: transmitter transmitting power .
According to the Cost 231-Hata model, the relationship between RSRP and is as follows: The transmitter transmitting power is as shown in equation (6):

Designing features based on geometric location
3.2.1. Feature five: Euclidean distance . The Euclidean distance between (X, Y) and (Cell X, Cell Y) can be used as a feature of the wireless propagation model:

Feature six: the spatial distance between the measurement point and the base station vertex .
Due to the complicated calculation of the spatial distance between the measurement point and the antenna center point, the spatial distance between the measurement point and the highest point of the base station is selected as a feature to approximately replace it, which is shown in equation (8): 3.2.3. Feature seven: the relative height of the grid to the main signal line ∆ℎ . The main signal line can be regarded as the direction with the maximum signal propagation power. The relative height between the grid and the main signal line can qualitatively reflect the strength of the signal. The spatial distance ∆ℎ from the measurement point to the main signal line can be taken as the feature of the wireless propagation model, which has the following two cases:

1)
The target grid is in the direction of propagation of the main signal line. In the propagation direction of the main signal line, the relative height between the grid and the main signal line is calculated as follows: 2) The target grid is not in the direction of propagation of the main signal line. For a grid beyond the propagation direction of the main signal line, the propagation power of the signal line (called subsignal line) propagating to the grid is weaker than the main signal line, and the signal strength is somewhat weakened. The formula for the distance between the grid in the propagation direction of the sub-signal line and the main signal is as follows, regardless of the height of the measuring point: represents the distance between the target grid to the main signal line in the Y-axis direction in the X-Y plane.
Where ∆ℎ ′ is the relative height between the main signal line and the grid which at the same X coordinate as the target grid, but in the direction of propagation of the main signal line.
Where ′ is the link distance between the base station and the grid which at the same X coordinate as the target grid, but in the direction of propagation of the main signal line.
3.2.4. Feature eight: vertical distance from the grid to the main signal line ℎ . The ℎ is available from the point in space to the straight line distance.

Feature nine: the angle between the measuring point and the signal line in the horizontal direction
. According to the horizontal direction angles of cell transmitter (Azimuth), (X, Y) and (Cell X, Cell Y), the angle between the measurement point and the signal line in the horizontal direction can be determined.
3.2.6. Feature ten: the angle between the measuring point and the signal line in the vertical direction . The calculation formula of the angle is as shown in equation (15): 3.2.8. Feature twelve: terrain and climate correction factor . The weather conditions are different, the air density, humidity, and temperature are different in the air, and the efficiency and speed of signal transmission are also different. There is a building in the grid where the cell transmitter is located, and the height of the building has a certain occlusion effect on the signal transmitted by the transmitter, and the above reasons are uniformly expressed as the terrain and climate correction factor .
Where, ∆ is the influence of other weather and terrain factors on the signal propagation process.  The unit direction vector of main signal line is as follows: The unit direction vector from the measurement point to the antenna is as follows: The angle between the direction vector of the measuring point and the direction vector of the main signal line is the space vector angle 3 .

Distance dimension reduction.
Considering that the distance between the measurement point and the base station is far, the spatial distance is approximately equal to the Euclidean distance .

≈
(20) Therefore, only one of the two features is needed, and comprehensively select the spatial distance as the feature.

Abnormal data processing
Abnormal data needs to be cleaned in advance. The following abnormal data was found by consulting relevant data and contacting the actual situation, as shown in table 2.

One-hot coding for qualitative features.
One-hot coding uses a binary bit to indicate the presence or absence of a qualitative feature, and uses the "OneHotEncoder" class of the "Preproccessing" library to encode the data.

Missing value filling and visualization.
As can be seen from the measurement point distribution map of the cell, there are some places in the cell that do not set measurement points to test RSRP. For the area where there is no measurement point, that is, the RSRP missing area, the value of the weak coverage decision threshold ℎ (-103 dBm) is used as the padding. Take the CSV file of a cell (No. 2302410) as an example, its measurement location distribution map is shown in figure 1, and the RSRP thermal map after filling the missing values is shown in figure 2.  In figure 1, the red area is the distribution of target grid measurement points, and the green point represents the location of the signal base station in the cell. In figure 2, the blue background is the value of the weak coverage decision threshold ℎ (-103 dBm), the deeper the blue, the weaker the signal of the measurement point. And the red area represents that the value of RSRP is strong, and the deeper the red indicates the stronger the RSRP.

Feature selection
After the feature design is completed, it is usually necessary to select meaningful feature to input machine learning model for training. For features designed by different methods, it is necessary to judge whether the feature is appropriate from multiple levels.

Existing feature availability assessment
We evaluated the usability of the designed features from three aspects: data acquisition difficulty, computational difficulty, and universality, and screened out ∆ℎ and h T .

Divergence analysis
The divergence of features is one of the important criteria for selecting features. The divergence analysis is performed on the designed features by the standard deviation, as shown in table 3. because 3 is in angle system, and the range is distributed between 20° and 180°, indicating a wide distribution between the cell and the measurement points.

Correlation analysis
The Pearson correlation coefficient Υ is widely used to measure the correlation degree between two variables, with values between -1 and 1.
Where − , , and are the standard values, mean values, and standard deviations for the samples, respectively. Calculate the correlation between the designed features and the target, and quantify and sort the results, as shown in table 4. Table 4. Correlation between features and targets. It can be seen from table 4 that the correlation between the spatial distance logarithm lg and the target is more obvious.

Feature correlation visualization
The visualization of the correlation between the designed features and the target is shown in figure  3. It can be seen from figure (a) that although the value of RSRP varies with , the samples are concentrated near two logarithmic frequency values. So the correlation between and RSRP is very small. Considering that the standard deviation of is 0.001485 and Pearson coefficient is 0.000339, is deleted. In figures (b)-(d), it can be seen from the horizontal axis that ℎ , ℎ and are variables, and their dynamic range is in a relatively stable range. It can be seen from the vertical axis that under the same ℎ , ℎ and , when other factors change, RSRP also changes, but for different ℎ , ℎ and , the range of RSRP changes differently. And the higher the ℎ , the more concentrated the value of RSRP. It can be seen that ℎ , ℎ and are correlated with RSRP.
In figures (e)-(f), it can be seen from the horizontal axis that and 3 are variables, and their dynamic range is in a relatively stable range. It can be seen from the vertical axis that under the same and 3 , when other factors change, RSRP also changes, but for different and 3 , the range of RSRP changes differently. With the increase of , the overall RSRP trend is decreasing. Moreover, the decline trend is obvious, indicating that has a strong negative correlation with RSRP. As 3 increases, RSRP decreases slightly. It can be seen that 3 has a weak negative correlation with RSRP.
In summary, the correlation between the spatial distance logarithm and the target is more obvious, which is consistent with the calculation result of the Pearson coefficient.

Final feature set
The final selected feature set is shown in table 5:

Model establishment
Based on the feature set established above and the training data set provided by Huawei cloud platform, an AI-based wireless propagation model is established to predict the RSRP in different geographical locations.

Modeling method
Back propagation (BP) neural network is currently the most widely used neural network, which is a kind of multi-layer feedforward neural network trained according to the error back propagation algorithm [10][11]. In view of the advantages of BP neural network, such as easy operation, strong nonlinear mapping and suitable for engineering problems. BP neural network is selected to train the wireless propagation model.

Model parameter settings
In order to train a more satisfactory wireless propagation model, we built and tested three BP neural network models. In order to highlight the features designed by feature engineering is more reasonable and useful, the models all use the same network structure (hidden layer is 100, output layer is 1, loss function is mse, learning rate is 0.01), and the model parameters are set as shown in table 6.

Training results
Combined with the scores given by the Huawei online scoring system, the training results are shown in table 7 below. The first version of the model is obtained by inputting 17 original parameters as features on the officially provided baseline. The second model is trained by using the simple designed features (using 4 highly correlated calculated relative heights, totalling 10 features to input). The third model is trained by inputting five carefully designed and selected features.
The scores of the three model are: 16.7258, 15.7867, and 12.7246. From the results score, it can be seen that the features designed by the typical method of feature engineering are superior, and the model with better effect can be obtained more quickly, which verifies the importance of feature design. Finally, the third version of the training model is selected as the target intelligent wireless propagation model.

Summary and outlook
Combining the mechanism of communication propagation loss, this paper uses the typical method of feature engineering to do feature design, data processing and feature selection to reduce the number of input parameters of the neural network model, which reduced the training amount and got more efficient model. It verifies the importance of feature design and also reflects the superiority of the feature set designed. In the experiment, in order to be convenient for calculation, a lot of approximation and simplification were done, and the application scenarios of the final model may be limited.

Acknowledgments
This work was financially supported by the National Natural Science Foundation of China (Grant No. 71871160), National Key R&D Program of China (Grant No. 2017YFE0100900) and 2018 industrial Internet innovation and development project from MIIT (The construction of the industrial Internet platform test bed for manufacturing business process optimization by network collaboration).