Table of contents

Volume 1727

2021

Previous issue Next issue

Big Data and AI Conference 2020 17-18 September 2020, Moscow, Russian Federation

Accepted papers received: 03 December 2020
Published online: 19 January 2021

Preface

011001
The following article is Open access

The 2020 Big Data and Artificial Intelligence Conference was successfully held on September 17-18, 2020 online due to COVID-19 restrictions. The conference is devoted to current challenges in Big Data analytics & AI and comprises three tracks: business, technology and science.

There were 3 major sections of the scientific track of the conference:

-Cluster analysis

-Applied systems for data analysis

-Natural Language Processing

This volume of IOP Conference Series: Journal of Physics: Conference Series (JPCS) is a compilation of the accepted papers in Big Data and AI Conference 2020 and represents contributions that were presented in the conference.

On behalf of the organization committee, I would like to thank all of the conference sponsors, partners and volunteers who made conference possible.

Looking forward to meeting you during Big Data and AI Conference 2021.

On behalf of organizing and program committees of the Big Data and AI Conference 2020

Igor Balk, Big Data and AI Conference 2020 co-chair.

011002
The following article is Open access

All papers published in this volume of Journal of Physics: Conference Series have been peer reviewed through processes administered by the Editors. Reviews were conducted by expert referees to the professional and scientific standards expected of a proceedings journal published by IOP Publishing.

Type of peer review: Single-blind

Single-anonymous: authors' identities are known to the reviewers, reviewers' identities are hidden from authors

Double-anonymous: author and reviewer identities are hidden to each other

Triple-blind: author and reviewer identities are hidden to each other, and from the Editor(s)

Open: author and reviewer identities are known to each other

Describe criteria used by Reviewers when accepting/declining papers. Was there the opportunity to resubmit articles after revisions?

• 1. Does the paper address relevant scientific questions within the scope of the conference?

• 2. Does the paper present novel concepts, ideas, tools, or data?

• 3. Are the results sufficient to support the interpretations and conclusions and substantial conclusions reached?

• 4. Is the overall presentation well structured and clear and the language fluent and precise?

• 5. Do the authors give proper credit to related work and clearly indicate their own new/original contribution?

Resubmission is allowed.

Conference submission management system:

Easychair.org and ai-conf.org

Number of submissions received:

32

Number of submissions sent for review:

32

Number of submissions accepted:

20

Acceptance Rate (Number of Submissions Accepted / Number of Submissions Received X 100):

62.5%

Average number of reviews per paper:

2

Total number of reviewers involved:

4

Any additional info on review process (ie plagiarism check system):

Plagiarism check was performed on all accepted papers.

Contact person for queries:

Igor Balk Global Innovation Labs, USA (science [at] ai-conf.org)

Papers

012001
The following article is Open access

In the current work we propose a method to extract regimes from time series using unsupervised learning. The proposed method is based on neural network with architecture of variational autoencoder and clusterization in latent space. The method has been proven by extracting regimes from a steam turbine telemetry data set and from human activity recognition data, which suggests that the proposed approach can extract regimes from time series data obtained for different areas.

012002
The following article is Open access

and

In this paper authors deal with tasks of reliably recover a hidden multi-dimensional model parameter from indirect process observations. Such task is known as inverse problem. There are a lot of inverse problems that have practical value, for example in seismic wave propagation, low-dose tomography. To solve many of these problems in a practical style, this article proposes an approach based on the many simulations of the corresponding forward problem and using the set of simulation data as the training dataset. Most of physical processes have computer models that generate precise results. The existing simulators provide ways to predict process output by input parameters. A difficulty in solving of most inverse problems is that the solution is sensitive to variations in data, which is referred to as ill-posedness. From broad spectrum of methods to overcome ill-posedness authors use machine learning model trained on special simulated data. The paper describes the deep network model using some regularization. The key idea is to use Generative Adversarial Network (GAN) to generate correct input parameters values and support the unique existence. This network is trained by parameter examples that are real solutions of inverse problem. The small manually built dataset transforms to infinite dataset automatically by GAN. The augmented dataset feeds the simulator to get output data to train deep learning network. The network has regularization layers to support stability. The paper describes details of this model using deep augmentation to solve inverse problems on the easy example: the task of throwing a heavy ball at an angle to the horizon, taking into account the force of friction against air.

012003
The following article is Open access

and

This paper describes a solution of computer vision task concerning multiclass fire segmentation to get and show location of red, yellow and orange flame. We use UNet model as the best open-sourced convolutional neural network baseline. Based on this model we introduce UUNet-concatenative and wUUNet models. Since the multiclass fire segmentation task is solved for the first time in science, we collect the appropriate dataset and use the dataset-labeling alignment via look-up-tables. Also we compare models trained by Soft Dice and Jac-card indexes in combination with binary cross-entropy as a loss functions. Paper shows the problem of accuracy loss at bounding nodes of splitting the frame. As a solution we introduce combinational methods of partially intersected areas. The comparison of the used models and calculation schemes is demonstrated and the corresponding conclusions of the investigation are made.

012004
The following article is Open access

, and

A description of a new method for the collective solution of local problems – the method of evolutionary coordination of solutions based on the original use of genetic algorithms – is given. Interaction rules, developed on their basis, coordinate the work of intelligent agents (actors). Based on the Rasch model, an absolute scale for measuring the intellectual power of actors and the costs of intellectual labor when solving local problems with a given probability of their correct solution is introduced. The unit of measurement for these values is introduced and justified – 1 INT. A number of theorems are presented that make it possible to substantiate a new procedure for obtaining a collective solution (mesing) both to increase the intellectual power of a committee of neural networks by 150 times in comparison with a single neural network as part of a committee, and to reduce the probability of an erroneous decision to zero under certain conditions. As a result of the committee's work, either the correct decision is formed, or the answer "no solution has been found" with a low probability of an erroneous decision.

012005
The following article is Open access

, and

One of the most important tasks of any platform for big data processing is the task of the storing data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format.

012006
The following article is Open access

and

The article discusses symbolic regression methods as a machine learning technology. The technique is tested on a complex problem of control systems synthesis. A new type of control based on changing the position of a stable equilibrium point is proposed. The implementation of such control requires the construction of a double feedback loop. The inner contour ensures the stability of the control object relative to some point in the state space. The outer contour provides optimal control of the stable equilibrium point position. To implement control, symbolic regression methods are used as machine learning technologies. It is shown that such a control is the least sensitive to external disturbances and model uncertainties.

012007
The following article is Open access

and

The approach to the two-stage classification based on the 1-SVM classifier used as the main classifier and the RF classifier used as the auxiliary classifier has been considered. The proposed approach improves the quality of classification when using imbalanced datasets. The results of the comparative analysis of the proposed approach and the alternative approach to the two-stage classification, in which the binary SVM classifier is used as the main classifier, and the RF classifier as the auxiliary classifier, are presented.

012008
The following article is Open access

, and

On the basis of the diffusion theory, we suggested a model for forecasting event in news feeds, which is based on the use of stochastic dynamics of changes in the structure of non-stationary time series in news text clusters (states of the information space). Forecasting events in a news feed is based on their text description, vectorization, and finding the cosine value of the angle between the given vector and the centroids of various information space semantic clusters. Changes over time in the cosine value of such angle between the above vector and centroids can be represented as a point wandering on [0,1] segment. This segment contains a trap at the event occurrence threshold point. The wandering point can fall into this trap over time. We have considered probability patterns of transitions between states in the information space. We have derived a nonlinear second-order differential equation; formulated and solved the boundary value problem of forecasting news events. We have obtained theoretical time dependence for the probability density function of the parameter distribution of non-stationary time series that describe the information space evolution. The results of simulating the time dependence of the event probability (with sets of parameter values of the developed model, which have been experimentally determined for already occurred events) show that the model is consistent and adequate. Experimental verification of the proposed model was carried out using a corpus of texts written in Russian.

012009
The following article is Open access

, , , and

The procedure of thresholding for graph construction is one of the common steps in calculating networks of brain connections. However, this procedure can lead to incomparable results from different studies. In the present study we aim to test the effect of thresholding or algorithmic reduction of the number of connected nodes on the construction of a set of widely used connectivity graph metrics derived from EEG data. 164 people took part in our study. Participants were recruited via social networks. EEG was recorded during resting state. At the beginning of the procedure each participant was asked to relax and not to think about anything. Source reconstruction was performed using standard source localization pipeline from MNE-package. Desikan-Killiany Atlas was used for cortical parcellation with 34 ROI per hemisphere. Synchronization was estimated with weighted phase lag index in 4–30 Hz frequency range for eyes closed and eyes open separately. We have found that All metrics except average participation coefficient vary monotonously as a function of density level (moreover, we have found, that for Cluster Coefficient, more than 95% and for the Characteristic Path Length ∼50% of the variance is related to thresholding cut-off). The different data-driven approaches to the network construction leads to significant changes in the group-level graph metrics and can eliminate the variance in the data that can be crucial for individual differences studies.

012010
The following article is Open access

, , , , , , , , and

This article is devoted to the development of a model of an artificial neural network for predicting the level of nonverbal intelligence according to the EEG of the brain. Cognitive functioning relies on the synchronization between different brain structures. However, it is still unclear how individual differences in intelligence are related to the global characteristics of information transmission in brain networks. Resting-state functional connectivity studies show the association of patterns of interactions between brain regions from people and different levels of nonverbal intelligence. In this study, we present a process of development of a neural network model used to predict the level of nonverbal intelligence based on EEG data of the brain. We have developed a fully-connected neural network to predict the level of nonverbal intelligence.

012011
The following article is Open access

The problem of consistency of medical data in Hospital Data Management Systems is considered in the context of correctness of medical images itself to minimize possible harm from spurious DICOM files. The approach should be considered as an addition to other securing techniques like watermarking, encrypting, testing conformance with the standard. To achieve amenable accuracy in practice two aspects are taken into account: correctness of periodicity and correctness of image data (time series) itself for considered modality. This paper proposes an architecture of an information system and network filter integrated to it to provide facilities for analysis and alert management. The architecture is designed to perform the analysis of incoming data streams within components working in a clustered manner, providing horizontal scalability and fault tolerance to be BigData-ready.

012012
The following article is Open access

, and

The article proposes the approach that allows to make multi-model systems of classification systems more suitable for solving the practical problems in the field of information technologies. To achieve this, it was necessary to provide a stable system response time to support SLA (service-level agree-ment) and, on the other hand, to minimize the downtime of server hardware. The first requirement would allow to give the requests correctly and quickly to the client, and the second one would make the cost of renting servers more justified. The main idea of the proposed approach is to obtain the pre-dictions with the highest possible accuracy in a fixe time. We considered three auction models: Dutch auction (rate decreasing), English auction (rate increasing) and the adapted version of Vickrey auction (highest rate switched off). In all cases, we used the highest probability of the mark as a rate, and we set a time parameter, after which the prediction is recognized as final. The obtained results and their comparison with the methods of ensembling and balancing allow us to conclude that the proposed approaches can be use-ful in the development of multi-model systems of classification.

012013
The following article is Open access

, and

One of the most significant tasks of echocardiography is the automatic delineation of the cardiac structures from 2D echocardiographic images. Over the past decades, the automation of this task has been the subject of intense research. One of the most effective approaches is based on the deep convolutional neural networks. Nonetheless, it is necessary to use echocardiogram frames of the cardiac muscle, which show the boundaries of the cardiac structures labeled/annotated by experts/cardiologists to train it. However, the number of databases containing the necessary information is relatively small. Therefore, generated echocardiogram frames are used to increase the amount of training samples. This process is based on the ultrasound images of the heart, annotated by experts. The article proposes an improved method for generating echocardiograms using a generative adversarial neural network (GAN) with a patch-based conditional discriminator. It has been demonstrated that it is possible to improve the quality of generated echocardiogram frames in both two and four chamber views (AP4C, AP2C) using the masks of cardiac segmentation with sub-pixel convolution layer (pixel shuffle). It is demonstrated that the proposed approach makes it possible to generate ultrasound images, the structure of which corresponds to the specified segmentation masks. It is expected that this method will improve the accuracy of solving the direct problem of automatic segmentation of the left ventricle.

012014
The following article is Open access

, , and

The paper discusses the results of the first stage of research and development an innovative computer vision system for the automatic asbestos content control in stones veins at an asbestos processing factory. The discussed system is based on the applying of a semantic segmentation artificial neural networks, in particular U-Net based network architectures for solving both: the boundaries of stones segmentation and veins inside them. At the current stage, the following tasks were solved. 1. The discussed system prototype is developed. The system is allowing to takes images of the asbestos stones on the conveyor belt in the near-infrared range (NIR), avoiding the outer lighting influence, and processing the obtaining images. 2. The training, validation and test datasets were collected. 3. Substantiated the choice of the U-Net based neural network. 4. Proposed to estimate the resulted specific asbestos concentration as the average relation of all the veins square to all stones square on the image. 5. The resulted deviation between obtained and laboratory given results of the asbestos concentration is about 0.058 in the slope of graduation curve. The farther improvement recommendations for the developed system are given.

012015
The following article is Open access

, , and

When creating geoinformation systems of a city scale, relation this or that information from Internet, there comes the task of creation an ontology of urbanonyms taking into account their historical changes. The account of historical changes is necessary, for example, to process messages about urban event from blogs: since more and more representatives of the middle and older generations are becoming active Internet users, the massages often contain the former names of urbanonyms. Let us note that it is the accounting of historical changes that is required to solve this problem that determines the need to create not a thesaurus, which is sufficient, as shown in [1], to take into account geographical names commonly used (at least in natural science articles) in their actual from, but an ontology. Taking into account the specifics of the task of creating an ontology of Almaty, in should be bilingual: in the Kazakh and Russian languages.

012016
The following article is Open access

, and

Text analysis is a promising field of study with many unsolved problems. First of all, most methods are labor and time consuming. We want to pay special attention to patents. The most important thing in analyzing patents as a reflection of a company's research activities is not to be late. Technology is emerging very quickly. So speed of response to changes in the world of scientific research is very important now. Therefore, we propose an alternative method of patent analysis based on clustering. Its main advantage is that it does not require different train/test datasets and it could be applied immediately. In this article, we compare different clustering algorithms, because the quality of the conclusions depends on it.

012017
The following article is Open access

and

In the NLP task, high priority was accorded to the accumulation of vocabularies. To complement them, you need to find unknown words. Unique words with the help of an expert can be determined in the dictionary. This paper presents a technique for finding unknown words in Named Entity Recognition (NER).

012018
The following article is Open access

and

At the current age there is an urgent need in developing massively scalable and efficient tools to Big Data processing. Even the smallest companies nowadays inevitably require more and more resources for data processing routines that could enhance decision making and reliably predict and simulate different scenarios. In the current paper we present our combined work on different massively scalable approaches for the task of clustering and topic modeling of the dataset, collected by crawling Kazakhstan news websites. In particular, we propose Apache Spark parallel solutions to news clustering and topic modeling problems and, additionally, we describe results of implementing document clustering using developed partitioned global address space Mapreduce system. In our work we describe our experience in solving these problems and investigate the efficiency and scalability of the proposed solutions.

012019
The following article is Open access

, , , , and

The paper proposes a method for evaluating text documents by arbitrary criteria, combining the topic modeling on the text corpora and multiple-criteria decision making. The evaluation is based on an analysis of the corpora as follows: the conditional probability distribution of media by topics, properties and classes is calculated after the formation of the topic model of the corpora. Weights assigned by experts to each topic along with topic model can be applied to evaluate each document in the corpora according to each of the considered criteria and classes. The proposed method was applied to a corpus of 804829 news publications from 40 Kazakhstani sources published from 01.01.2018 to 31.12.2019, in order to classify negative information on socially significant topics. A BigARTM model was calculated (200 topics) and the proposed model was applied. Experiments confirm the general possibility of evaluating the sentiment of publications using the topic model of the text corpora, since ROC AUC score of 0.93 was achieved on the classification task.