The use of intelligent technologies of Business Intelligence Platforms and Data Science and Machine Learning Platforms for monitoring the socio-economic indicators of the administrative districts of Moscow

The article discusses the possibilities of studying the state of the social sphere according to the repository of the Moscow Government open data portal by administrative districts and city districts using Business Intelligence Platforms and Data Science and Machine Learning Platforms intellectual technologies. Opportunities are presented for using machine learning technologies for business analytics platforms to identify hidden patterns in order to make informed management decisions.


Introduction
Currently, when collecting and analyzing data obtained from databases of state statistics services, portals of open data and other sources, there is a problem of obtaining high-quality and visual information. As a result, the information available to executive authorities and individual citizens is largely expert and heuristic in nature. Therefore, there is an objective need to use intelligent business intelligence (BI)platform technologies for monitoring socio-economic indicators of administrative districts of Moscow [1,2].
Augmented analytics technology has become the main development driver of BI platforms and platforms for data processing and machine learning. Augmented analytics technology, based on machine learning, provides insight into huge amounts of data. This technology also includes natural language processing as a way to query data and create narratives to explain drivers and graphics. Nowadays, up to half of all analytical queries in BI platforms are generated using search, natural language processing or generated automatically. Natural language processing and conversational analytics are increasing the spread of business intelligence among civilian professionals, including new classes of users, especially employees of front offices and city municipalities.  Figure 1 shows a collection of different visualizations created in the Power BI service. Let us find "similar" objects among 146 districts of Moscow. Searching for similar objects is one of the most common tasks in data analysis. We determine the similarity based on the characteristics of the data set "the number of families receiving subsidies in Moscow in various sections": the total number of families, low-income families, multi-child families, single-parent families, retired families, students' families, unemployed families and the number of requests to open data portal. In Power BI clustering support provides powerful analytical capabilities and Power BI supports the k-means algorithm, the most popular clustering method. One of the most difficult tasks in clustering is to determine the number of clusters (Figure 2). To facilitate this task, Power BI provides both automatic and manual control options.  To check the results, we will perform clusterization of Moscow districts using other clusterization algorithms. As a tool for cluster analysis, we will use the Kohonen neural network, implemented in Deductor Studio analytical platform, the advantage of which in comparison with other algorithms is the ability to visually analyze multidimensional data: similar objects fall into neighboring map cells ( Figure  3).

Visual data analysis using Business Intelligence Platforms
Verification of clustering results using Kohonen maps and k-means can be performed in the KNIME Analytics Platform (Figure 4), where it is easy to perform fuzzy c-means (FCM) clustering, the feature of which is that each data point is assigned to a cluster with a membership function that varies in the range from 0 to 1 [4-8]. Clusters are represented as fuzzy sets and in addition, the boundaries between clusters are also fuzzy. The degree of belonging is determined by the distance from the object to the corresponding centroids.  Corelation between the characteristics of our data set reflecting the social status of residents of city districts are investigated using Power BI dashboards. They include a correlation matrix of the studied characteristics, a dot diagram of the population of the cluster districts and the number of requests from citizens to the open data portal and a map of the city districts ( Figure 6).
For more in-depth data research, we use the Qlik Sense platform, which combines free-form associative research, provides context-sensitive suggestions and automatically constructes  To identify key factors of influence, we use the appropriate visualizer in Power BI that will help you to understand what factors affect the metric under study (in our case, the cluster number that the Moscow region belongs to). Using machine learning in Power BI helps to explore data, allowing to perform deep analysis to automatically search for patterns, interpret results in a clear way and predict results. Using machine learning (a linear regression model), the factors of influence are ranked from the most to the least significant and a probability indicator and text description are provided to explain the influence. For example, for districts in cluster 4 (Figure 8), the most significant factor is the number of families of students receiving the grant.
For the districts of cluster 0, the most significant factor is the number of unemployed families receiving subsidies. For districts of cluster 1, the most significant factor is the number of single-parent families receiving the subsidy. For cluster 3 districts, the most significant factor is the number of families of students receiving the subsidy. For the districts of cluster 4, the most significant factor is the number of families receiving the subsidy. For the districts of cluster 5, the most significant factor is the number of unemployed families receiving subsidies. One of the fastest ways to get an answer from data is to ask a question in natural language. This feature is provided by the Q&A function in Power BI. Q&A is interactive option, in which often one question leads to another, because visualization opens up interesting ways to achieve the goal. In the Power BI service, the dashboard contains sheets pinned to one or more datasets, so you can ask questions about any data contained in the dataset. The Q&A tool recognizes the entered words and finds out where (in which data set) to find the answer. Q&A will also help to form a question with autofill, recalculation and other text and visual suggestions. The answer to the question is displayed as an interactive visualization and updated as the question changes.

Application of machine learning technologies
There is a close correlation between the number of families receiving subsidies and the number of requests from citizens to the open data portal (Figure 9). To identify hidden patterns in data and automate the main tasks of machine learning, we use the customizable artificial intelligence platform H2O Driverless AI. The platform provides automatic engineering of functions, checking and configuring models, selecting and deploying models, interpreting machine learning, creating custom scenarios in model construction, processing time series and texts, and automatically generating lines for model scoring.  The best model in our experiment is LightGBM, a gradient boosting framework developed by Microsoft that uses decision tree-based learning algorithms. It was specifically designed to reduce memory usage, increase learning speed and improve efficiency. Like XGBoost, it is one of the best gradient boosting implementations available [9][10][11][12][13]. It is also used to fit Random Forest models inside Driverless AI. The results are found in the experiment summary at the bottom right of the experiment page. Once the best predictive model is found, we can interpret it. MLI Dashboard presents various types of explanations concerning the model and its results [14][15][16]. All graphics on the dashboard are interactive. The number of single-parent families, families of the unemployed, families of students and large families receiving subsidies are of the greatest importance for forming a model for predicting the number of requests from citizens to the open data portal. Moreover, the contribution of these features to the model is different. Due to the high correlation with the output variable, the following indicators are: the number of families, retired families and low-income families receiving subsidies do not participate in the formation of the model. Additionally, H2O Driverless AI allows to download automatically generated documents, such as" Download experiment report "and" MLI Report", at the touch of a button. The developed model can be used to forecast the number of requests from citizens to the portal of the Moscow Government.

Conclusion
The results confirm the hypothesis that there is a positive correlation between the number of families receiving subsidies in Moscow in various sections and the number of requests from citizens to the open portal of the Moscow Government. This indicates that it is possible to influence the mechanism of participation of active Moscow citizens in the life of the city. The obtained objective information can be used to develop a strategy and make decisions on the development of the city.

Acknowledgment
The work is partially supported by the RFBR grant # 18-413-770006.