Chinese image caption of Inceptionv4 and double-layer GRUs based on attention mechanism

In recent years, there has been a wave of research on English image caption at home and abroad. However, due to the particularity of Chinese image caption task, the research on Chinese image caption has not made good progress. In order to solve this problem, a new Chinese image caption model is implemented. Firstly, the AI challenge dataset is enhanced, and then the Chinese text data of the dataset is preprocessed by Chinese word segmentation tool word2vec. Secondly, based on the encoder-decoder framework, the image visual features are extracted by Inceptionv4 network, the attention mechanism is incorporated in the process of feature extraction and the Chinese sentences are generated by double-layer GRUs network. In the process of training, Adam is used to optimize the algorithm. Finally, A GUI interface is designed to better show the experimental effect. Experiments show that the new Chinese image caption model can automatically generate more fluent and more accurate Chinese caption sentences, and the trained model has excellent performance in many evaluation indexes.

the original model, which is using high-level semantic features. Chen [7] et al modified part of the recurrent neural network (RNN) structure of the decoder itself in the encoder-decoder structure, so that the RNN network can not only translate images into text description data, but also apply the network model in reverse. It means that the image characteristics are obtained from the text description data. Li [8] et al created the Chinese image caption dataset flickr8kcn, and then designed the Chinese image caption model cs-nic. In the encoder part of this model, the Google net network structure is used to extract the image feature information, and the long-term and short-term memory network (LSTM) is used as the decoder to generate the image text information description.
In summary, this article uses the AI challenge dataset [16] after data enhancement, based on the encoder-decoder framework, and designed an Chinese image caption network that integrates attention mechanism. Based on the four evaluation indicators of BLEU, METEOR, CIDEr, and Perplexity [9], this paper makes a detailed comparison and evaluation of common image description models. Finally, this paper designs a GUI interface for the designed network, which is convenient for the operator to operate intuitively.

Inceptionv4 model
In 2015, Inceptionv3 was proposed by Szegedy et al [10]. Inceptionv3 model is the third generation of perception network structure of Googlenet.Inceptionv4 is developed on the basis of Inceptionv3. Inceptionv4 improves the model structure of Inceptionv3 and is the further development of concept model. The concept structure is highly adjustable. Szegedy et al [11] carefully adjusted the size of each layer of the model to balance the amount of computation among the subnets of each model by adjusting the size of each layer. The encoder uses CNN to extract the visual features of the image, and the CNN algorithm used in this paper is Inceptionv4 [11].
Inceptionv4 is trained on Imagenet [12]. In this paper, the trained network model is used to extract visual features, and Inceptionv4 is used as the encoder network. Finally, the image feature information collected by the network is input into the decoder network.

GRU model
In the original RNN neural network, the neural network can't memorize information, and it doesn't refer to the information that has appeared in the past when it processes the events that appear at every moment.
In order to solve the above problems, LSTM and GRU are used as the extension models of traditional RNN neural network. The core of LSTM model is a memory cell, which is controlled by three kinds of gates: forgetting gate, input gate and output gate. Compared with LSTM network, GRU model has a simpler structure, which is only controlled by two gates, reset gate and update gate. This model reduces a lot of network training parameters and improves the accuracy and computational efficiency of network model [13].The network structure of GRU is shown in Fig. 1.  Figure 1: r represents reset gate; z represents update gate; H t-1 represents output state of previous time; h tt represents hidden state information of current time; H t represents output value of current time. The figure is expressed as follows: In the formula, σ represents the sigmoid function, * represents the operation of multiplying the corresponding elements of the vector; W represents the weight matrix.

I-GRUS model
The Chinese image caption model in this article is based on the encoder-decoder structure of the NIC network model, where I stands for Inceptionv4 and GRUs stands for double-layer gated recurrent unit.
The I-GRUs model proposed in this paper is based on the encoder and decoder architecture, and consists of Inceptionv4 and a double-layer gated recurrent unit GRUs. As shown in Fig. 2, in the encoder and image feature extraction stage, this paper chooses the large-scale data set ImageNet The trained Inceptionv4 is used as a pre-training model to extract image features, and as an encoder to extract visual feature information of the image. The attention mechanism [14] is incorporated in the process of feature extraction, and the extracted image visual feature information is sent to the GRUs network unit as the decoder. The encoder uses a double-layer gated recurrent unit GRUs network structure to construct the language model, and the first layer of GRUs is used to receive The feature information of the image and the word embedding vector processed by Word2vec are used for preliminary modeling and transformation of features. The second layer of GRUs is used to receive the output of the first layer of GRUs and the output of the previous time step to form multi-modal features, and the image The feature information and the text feature are mapped to the same feature space. The entire model of I-GRUs is expressed as: Where I represents the input image, x -1 represents the acquired image feature, x t is the input of each time step of GRUs, N is the maximum time step of GRUs, m g is the mapping matrix of the vocabulary, w t is the Word2vec [15] code of the vocabulary entered in the GRUs network at each time step, w 0 represents the character at the beginning of the sentence, w N-1 represents the character at the end of the sentence, m g w t represents the word embedding vector corresponding to the vocabulary, p t+1 is the vocabulary generation probability of the next time step in the sentence.

Dataset
The AI Challenge dataset [16] used in this article is a large-scale artificially labeled Chinese dataset proposed in the 2017 AI Challenge competition. In the AI Challenge dataset, there are 210,000 images in the training set and 30,000 images in the verification set. There are 30,000 pictures in the test set.
Based on the AI Challenge dataset, this paper enhances the AI Challenge dataset in the form of image geometric transformation (flip, rotate, crop, deform, zoom), making the experimental dataset have 1.05 million images in the training set and the verification set There are 150,000 pictures and 150,000 pictures in the test set.

The Experimental Results
During the experiment, the deep learning model training framework used in this article is Tensorflow1.8, the image input size of the Inceptionv4 network is 299*299*3, and the language model is constructed with double-layer gated recursive units GRUs, and the cross-entropy loss function is used to do For the loss function, the algorithm optimizer used is based on the original optimizer SGD [19], and Adam [18] is used as the optimizer.
The experimental results in this paper are evaluated by four indicators including BLEU, METEOR, CIDEr, and Preplexity [9]. BLEU is the current evaluation standard for machine translation, but BLEU is an evaluation index that only focuses on accuracy. The METEOR evaluation index not only considers the semantic accuracy of the generated sentence, but also considers the recall rate of the generated sentence. The CIDEr evaluation index can reflect the generated The semantic similarity between the sentence and the reference sentence. Perplexity is also called perplexity. It is usually used in natural language processing problems to evaluate the quality of language training models. The greater the sentence probability, the smaller the Perplexity, and the better the language model.In this article.Tab

The Visual Results
We use the trained Inceptionv3-LSTM, Inceptionv3-Attention-LSTM and I-GRUs networks to make beautiful, concise, and friendly user interaction software interfaces in the PyQt5 framework of Python. At the same time, the program is based on Qt's signal-slot function program mechanism , which has good robustness during program operation. The Qt can map and connect the signal function with multiple corresponding slot functions. When the program is running, all the slot functions bound by the signal function will be executed. The operation buttons of the entire software interface are simple in design and have a friendly graphical interface, which can intuitively display the detection content and meet the needs of users. It only takes 2-3 seconds from input to detection for a picture, which is robust and can adapt to the needs of different environments. As shown in Fig. 3.

Conclusion
Based on the structural characteristics of the encoder-decoder of NIC model, this paper designs a new Chinese image caption model based on the original NIC network model, using AI challenge which is expanded by us, using the Inceptionv4 network which integrates attention mechanism, combined with the double-layer gating recursive unit GRU network. Our model is better than other network models in BLEU, METEOR, CIDEr, Perlexity, and the generated image caption sentences can be more accurate. Finally, we use PyQt to make a graphical user interface, which can clearly convey the generated image Chinese caption sentences to users. This research is helpful to reduce visual impairment for people with visual impairment.It is of great significance to the early education of infants, image retrieval, and has great reference value for the future research of neural network model of Chinese image caption.