Research on Intelligent Recommendation of Maintenance Scheme based on Multi-modal Data Learning

In view of the unavoidable degradation and damage to the performance and stability of military, medical, transportation and industrial equipment in the process of daily use, especially after long-term operation or shutdown, and the maintenance personnel cannot be timely in place, an intelligent recommendation technology or program for remote maintenance of equipment based on multi-modal data learning is studied. Multi-modal data such as text, pictures and videos are integrated. Based on image recognition and learning of multi-modal data, intelligent recommendation of remote maintenance solutions can be realized when maintenance personnel cannot arrive in time.

time of hospital diagnosis and treatment, but also threatens the life safety of some emergency patients due to the breakdown of diagnosis and treatment equipment.Therefore, the traditional maintenance methods gradually difficult to solve the problem of hospital equipment failure.Rapid and efficient system identification for large-scale medical imaging equipment failures and emergencies, and providing intelligent solutions, or remote engineers marking guidance to accurately solve medical equipment failures has become an urgent problem to be solved.
In the field of aviation and aerospace, according to ASN aviation safety database statistics, a total of 363 aviation accidents have occurred in the world since 2020, and the crash of US military aircraft is even more frequent.It can be seen that the global aviation accidents present a state of multiple.It is worth warning that the 1989 air crash in Soviet airspace was a major safety accident caused by engine failure.Because the maintenance personnel did not find the cracks caused by metal fatigue of the fan disc, the fan disc burst during the flight and led to the failure of three sets of hydraulic lines, resulting in the final 112 people died.Whether it is military aircraft or civil aircraft, once the failure may be devastating, military equipment falling into the territory of other countries may also cause serious leakage accidents, thus posing a threat to military security.In addition, precision instruments in the aviation field account for a large proportion, expensive cost, complex operation and maintenance system, and no matter unmanned space or manned space engineering, because of the limitations of equipment space and cost, cannot be equipped with a large number of equipment maintenance personnel, can only rely on intelligent fault detection and maintenance methods.Therefore, a remote fault detection and solution approach is urgently needed.In today's economic globalization, affected by the complex international environment, the inconvenience of transnational travel has become a major problem hindering the maintenance of equipment, which is not conducive to the normal management, use and maintenance of important equipment, and seriously affects the demand of enterprise efficiency construction.The realization of intelligent remote maintenance can not only effectively reduce the work intensity of maintenance personnel, but also improve the accuracy of maintenance, help to increase efficiency and reduce cost, and enhance the competitiveness of enterprises.It is an inevitable choice for the significant increase in the difficulty of complex equipment maintenance and the development and application of new technologies.

Feature extraction and object detection and recognition of high dimensional data based on deep convolutional networks
Convolutional neural network has strong ability of expression and modeling.It can use 2D structure of image to learn feature representation of object layer by layer and automatically, and realize abstract and description of object.The overall framework of this algorithm is shown in Figure 1.The input image is sent to the deep network, and CNN features are extracted and the category and position coordinates of the target are directly predicted using the deep network.The computational quantity ratio of the depth-separable convolutional module to the standard convolutional module is (the size of the input feature map is  •  ): After the basic network, a feature extraction layer with large size changes is added to carry out further convolution operation on CNN features extracted from the basic network.Different convolution layers have different receptive fields, which can adapt to objects at different scales and achieve the purpose of multi-scale.These layers are then deconvolved from the last layer and feature fused with the previous layer.
After comprehensive consideration of shallow features and deep features, combined with detailed information and macro features, the fusion feature map obtained will be used as the basis for prediction.First, four convolutional layers are added from the basic network.The feature graph sizes of the last two convolutional layers of the basic network are 19 * 19 and 10 * 10 respectively.The size of the feature map of the new convolutional layer is respectively 5 * 5, 3 * 3, 2 * 2, 1 * 1.The last two layers of the base network and the new convolutional layers are bottom-up convolutional layers.Starting from the convolution layer with the size of 1 * 1 in the last feature figure, feature fusion is carried out layer by layer to form the top-down feature layer, as shown in Figure 3. Deconvolution is performed on the last layer to form the corresponding top-down convolution layer, and then the top-down convolution layer is fused with the corresponding bottom-up convolution layer.In the fusion process, deconvolution operation is performed on the top-down convolution layer first, and it is reconstructed into the spatial resolution of the previous convolution layer.After deconvolution, a top-down convolutional layer with the same spatial resolution as the previous layer is obtained, and then it is spliced together with the previous convolutional layer (i.e. the bottom-up convolutional layer) along the direction of the channel.After that, a 1 * 1 convolutional layer is connected, and the information is integrated together to form the integrated feature layer through the linear weighted combination of channels.The fused feature layer will act as the next top-down convolution layer and then perform the same operations as the previous bottom-up convolution layer.

Multi-modal entity alignment based on adaptive feature fusion
The goal of multimodal entity alignment is to align entities in two different multimodal knowledge maps.The multimodal Knowledge graph (MMKG) usually contains information about multiple modes.In this paper, we focus on the structural information and visual information of knowledge graph without losing universality.First, we represent MMKG as  , , ,  , where E, R, T, I respectively represent entity, relation, triplet and image.The relational triple  ∈  can be expressed as  , ,  , , where  ,  ∈ ,  ∈ .For an entity in the graph, there are multiple images associated with it.
Given more than two modal knowledge map,   ,  ,  ,  ,   ,  ,  ,  and seed entity right,  ∈  ,  | ∈  ,  ∈  .The task of multimodal entity alignment is to find potential matching entities based on seed entity pairs.
At present, the visual information images of the multimodal knowledge map come from the Internet search engine.Inevitably, there are noisy images.Indiscriminate use of these picture information will lead to poor utilization rate of visual information.The image-text matching model [2][3] can calculate the degree of similarity between image and text.Inspired by this, in order to solve the problem of poor use of visual information, a visual feature processing module is designed in this work to generate more accurate visual features for entities to help them align.Figure 4 describes in detail the process of generating entity visual features.In the absence of supervised data, this paper uses the pre-trained imagetext matching model to generate the similarity between images and entities.Next, set the similarity threshold to filter the noise image; Based on the similarity score, the image is given the corresponding weight, and finally the visual feature representation of the entity is generated.The pre-trained image-text matching model was used to calculate the similarity score of each image in the entity image set.This paper utilizes the Consensus aware Visual Semantic Embedding (CVSE) [4], which incorporates consensus information (common sense knowledge shared between two modes) into image-text matching.The model is trained on data sets MSCOCO [5] and Flickr30k [6].We calculated the image-entity similarity score based on the CVSE model and its training parameters.
The input of the visual feature processing module is the name of the entity and the corresponding photo set of the entity, as shown on the left side of Figure 4. Firstly, image embedding  ∈  of the entity image set is generated, where is the number of images in the entity corresponding image set.We use the target detection algorithm Faster−RCNN [7] to generate 36 2048dimension feature vector for each image.Then the Entity Name [Entity Name] is expanded into A sentence {A photo of Entity Name}, which is sent to bi−gated loop unit Bi−GRU to generate the text information  of the entity.
Then, image embedding  and text information  were sent into the CVSE model.We removed Softmax layer of the CVSE model to obtain the similarity score of images in the entity image set:    ;  2 Where,  ∈  is the similarity score of multiple images generated, and n is the number of images in the image set corresponding to the entity.
Filter noise images.Considering that there are some pictures with low similarity in the entity picture set, the accuracy of visual information will be affected.In view of this, the similarity threshold α is set to filter the noise picture:

Construction of multimodal knowledge map for device governance
The key to construct multimodal knowledge map is to supplement the information of other modes for entities.Taking visual information as an example, the main task of constructing the prototype system is to refine the image collection of the entity acquired by the Internet.The picture refinement flow chart of entities in the multimodal knowledge graph is shown in Figure 5.The traditional k-means [8] algorithm can divide the image set into K clusters.This method is simple and fast, but it needs to set K value (the number of generated clusters) in advance, and is not conducive to finding clusters with large size differences, and is sensitive to noise and isolated point data.The number of similar pictures in the real picture set returned by the search engine is quite different, some pictures have more than 10 similar pictures, some only 2-5 similar pictures; Moreover, there are pictures with large differences, namely isolated points, which may describe the visual features of entities from different angles, so they cannot be filtered directly.DBSCAN algorithm based on density clustering can solve this problem to a certain extent.This algorithm divides clusters based on the distribution density of vectors.Compared with K-means algorithm, which is insensitive to data sets, it can be applied to data sets with multiple outliers, namely isolated points.In addition, the DBSCAN algorithm can find isolated points that do not belong to any cluster without setting the number of clusters in advance.
A similarity score was given to the set of unduplicated images.CVSE model and data preprocessing method were used to generate text-image similarity score.To solve the polysemy problem, we replace the sentence {A photo of Entity Name}, which expands from Entity Name, with entity description.For example, for the physical movie Man of Steel, the previous text input was {A photo of Man of Steel}, the updated text input is {Man of Steel is a 2013 superhero film based on the DC Comics superhero Superman.}.Based on this process, we obtain the similarity score of the images in the image collection.

Anomaly state mining based on association rules
Association rules are used to discover potential associations between data, most typically in e-commerce shopping cart analysis.The historical running state data of the equipment is also closely related.The parameters of the equipment in normal running state and the parameters of the equipment in different fault states are often different correlation.Therefore, the association rules are used to mine the correlation between the equipment running state data, and then the abnormal state display characteristics There are three important tree terms in association rules: Support, Confidence, and Lift.Support is how often an item appears in all shopping carts.If we want to be the association of two items, then support is the frequency with which the two items appear together.Support is used to mine the universality of the relationship.Reliability is the frequency with which the second item appears when the first item appears.
Support: Support is the proportion of several associated data occurrences in the data set to the total data set.Or the probability of several data associations occurring.For data set S containing data X and Y, the support degree of item set {X, Y} is: Confidence: Confidence represents the probability of the occurrence of another piece of data, or the conditional probability of the data Figure 6.Apriori Flow chart For A simple example, if the support degree of association rules for cola and potato chips is 20%, that for cola purchase is 30%, and that for potato chips purchase is 50%, then the action degree is 1.33 1. A-b rule has a promotion effect on goods.

Anomaly state mining based on time sequence analysis
The parameters of the equipment in normal operation often show the characteristics of continuous change in time, so the time sequence analysis method is used to mine the abnormal running state of the equipment in time dimension.The time exception requires historical data, then given the data of the day, and determine whether it is abnormal.The time anomaly detection algorithm adopts S-ESD algorithm.S-ESD algorithm mainly consists of time series decomposition part and ESD part.The BrubbsTest is the main basis for ESD anomaly detection algorithms.Grubbs' test is a hypothesis test method.It is often used to test a single outlier in univariate data set Y, which obeys the positive variate distribution.If there is an outlier, it must be the maximum or minimum value in the data set.The original hypothesis and alternative hypothesis are as follows: H0: There are no outliers in the data set H1: There is an outlier in the data set The test statistic used by GrubbsTest to test the hypothesis is: Among them, the  for average, standard deviation s.The null hypothesis H0 is rejected, and an exception occurs when the test statistic meets the following conditions: In view of the nature of seasonal and trend of time series data, anomaly detection cannot be treated as isolated sample points.Twitter's engineers have come up with the algorithm for Seasonal ESD and S-H-ESD (Seasonal Hybrid ESD), which extend ESD to time series data.STL decomposed time series data into trend component, periodic component and residual component, as shown in Figure 7 below.

Semi-supervised image anomaly detection based on self-coding network
The task of image anomaly detection is to detect abnormal samples and abnormal areas from the perspective of sample level and pixel level.In medical, industrial and other fields, accurate image labeling needs to be completed by relevant professionals, which consumes a lot of manpower and material resources.Semi-supervised learning method reduces the requirement for accurate image labeling, and pixel level labels are not needed in the training process.You just need a data set made up of normal samples.Self-coding network is a network model that learns the potential spatial representation of samples in an unsupervised way.By making output and input as close as possible as constraints, important representations of input samples can be learned and have a good reconstruction effect.We introduced a reconstruction strategy based on input image processing.Inspired by the sparse Self-coding network (DAE) [9], we conducted certain processing operations on the input image, so that the self-coding network could learn more representative feature expressions in the reconstruction process.The self-coding network semi-supervised image anomaly detection framework under this strategy is shown in Figure 8.In this frame diagram, compared with the original image reconstruction strategy [10], the original image is no longer directly used as the input in the self-coding network, but goes through an image processing module, and the processed original image is introduced into the reconstruction network as the final input.The operation of the image processing module on the original image can be some kind of image processing method or some kind of neural network model.In the process of self-coding network training, the output picture of the network should be as close to the original image as possible, and the model parameters can be constrained by loss functions such as mean square error described in the previous section.In the process of model testing, the test image can first pass through the image processing module and then input into the self-coding reconstruction network to obtain the reconstructed image.The difference between the reconstructed image and the original image is preprocessed and used as the judgment basis of the sample level anomaly and pixel level anomaly.The original test image can also be directly input into the trained self-coding model.[11] Due to the adoption of the corresponding reconstruction strategy in the training process, the model can pick up more important image features and reconstruct the image with them, so it can better reconstruct the normal image, and for the abnormal image, it can fill the abnormal area of the image with the corresponding normal texture.

Recommendation of multi-mode maintenance scheme based on rule engine
Fault type and location can be initially diagnosed based on abnormal status identification, and structured fault data can be defined, such as fault type, device status parameters, and related descriptions.Combined with a large number of equipment maintenance records, fault cases and other data accumulated in the process of historical maintenance support, fault maintenance rules are summarized, and then the rule engine is used to screen and match the fault cases meeting the conditions for recommendation.
Drools is an expert system implemented with rules.At its core, it is an inference engine capable of processing a vast array of rules and facts.The matching algorithm adopted by it is an improved version of Rete algorithm.Rete algorithm is a highly efficient pre-chain reasoning algorithm, which builds an effective pattern matching network according to the rule base and records the state of nodes in the matching, so as to obtain effective parsing and high efficiency.
The Rete algorithm in Drools is called ReteOO.This method is triggered from an initial fact, and it continuously applies rules to reason or perform specified actions, gradually matching operations, and finally reaching the corresponding conclusion.The Drools rules engine process is shown in Figure 9 below.The Drools rules engine reasoning process is as follows: 1. Load rules to the rule base.2. Load the facts to be matched to Working Memory.3. Match the loaded rules and facts in the pattern matcher.4. Add multiple matched rules to the conflict set. 5. Resolve conflicts according to the conflict resolution strategy, determine the rules to be activated and their activation order, and add the rules to the agenda in order.
6. Execute the rules in sequence and repeat steps 3-6 until there are no active rules in the agenda.

Scheme query and recommendation based on learning sort
Based on abnormal status identification, the type and location of faults can be initially diagnosed.According to the disposal methods of various faults in the maintenance manual, the disposal specifications matching the faults can be found and the solutions can be recommended.In this case, the recommended fault solution is simply converted into a text matching problem, that is, the problem of matching the fault description text in the fault query and operation specifications.As shown in Figure 10, Firstly, it preprocesses the information text and the regulation document respectively to obtain the query and regulation data, and quickly obtains the candidate query-regulation pair with high matching degree through the recall model.Then, it extracts multi-level shallow and deep features from the queryregulation pair through the more complex matching model, and takes the combined features as the input of the learning ranking model.And learn their recommendation ranking by minimizing category loss or ranking loss.

ICAITA-2023
Figure 10.Short text matching model framework based on learning ordering When combining features using methods such as maximum pooling, we can minimize classification losses or ranking losses by learning the model parameters of matching models that extract deep features.In addition, a learning ranking method based on a single deep feature can also be used to learn the recommendation score of Query.For example, the Pointwise model [12] framework based on single sentence vector feature learning sorting and the Pairwise model framework based on single sentence vector feature learning sorting are respectively shown in Figure 11 and Figure 12:

Conclusion
Intelligent recommendation of maintenance solutions based on multi-modal data learning can solve the problem of providing equipment maintenance solutions when maintenance personnel cannot arrive in time in emergency situations.By integrating multi-modal data such as text, pictures and videos, and based on image recognition and multi-modal data learning, intelligent remote maintenance can not only effectively reduce the work intensity of maintenance personnel, but also improve the accuracy of maintenance.It is helpful to increase efficiency and reduce cost, improve competitiveness, and is an inevitable choice for the significant increase in the difficulty of complex equipment maintenance and the development and application of new technologies.

Figure 1 .
Figure 1.Overall framework for deep convolutional networks The basic network uses MobileNet v2[1] as the basic network structure and automatically extracts high-level features from the input image through convolution operation.The selection of parameters and layer types of the underlying network directly affects the speed, performance and memory required for feature extraction.As shown in Figure 2, The convolution kernel of each deep convolution is only applied to a single channel of the input.Point convolution operation is performed on the convolution result after deep convolution.Point convolution is applied to all channels, combining the operation results of a single channel.

Figure 2 .
Figure 2. Standard convolution is decomposed into a deep convolution and a point convolution

Figure 4 .
Figure 4. Visual feature processing module | ∈   ,    3   represents the initial image set, and   represents the image set after filtering out the noisy image.Entity visual feature representation generation.For images in   , we assign weights based on their similarity scores to generate a more accurate visual feature representation  for entity  :    4 Where  ∈  represents the visual feature of entity  ; ′ ∈  is the image feature generated by the Resnet model, ′ is the number of images after removing noise. represents the weight of attention in a picture:    5 Where  is the similarity score of the image   .ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012012 IOP Publishing doi:10.1088/1742-6596/2637/1/0120126

Figure 5 .
Figure 5. Multimodal knowledge map entity picture refinement flow chart

Figure 7 .
Figure 7. Time series decomposition Using ESD in the residual component of STL decomposition, the outliers on the time series can be obtained.

Figure 8 .
Figure 8.Other reconstruction strategies self-coding network semi-supervised image anomaly detection framework diagram

Figure 11 .
Figure 11.Pointwise model framework based on sentence vector feature learning sequencing

Figure 12 .
Figure 12.Pairwise model framework based on sentence vector feature learning sequencing