Sentiment analysis of commodity reviews based on ALBERT-LSTM

Sentiment analysis of product reviews has now become one of the important research directions in the field of NLP, which is of great significance in helping merchants better understand user preferences and providing decision-making basis for other users to purchase products. Existing product review sentiment analysis models based on deep learning mostly use traditional word vector models. It is difficult to obtain the contextual semantic information of words, and there is a problem of “polysemous one word”. A product review sentiment analysis model combining ALBERT and LSTM is proposed. First, we use the ALBERT pre-training model to get the word vector containing sequence and semantic information. Then we use the LSTM model that can obtain long-distance semantic features for training. Finally, the emotional polarity of product reviews is classified and output through the emotional polarity discrimination layer. The experimental results on the data set of digital products of JD Mall show that the F1 value of the ALBERT-LSTM model is improved compared with the ALBERT and LSTM models.


Introduction
With the increasing development of the Internet and e-commerce platforms, convenient and fast ecommerce platforms have attracted more and more consumers to shop online. People's consumption behavior on the platform has generated a lot of data such as historical records and comment information. These data can reflect the true opinions and feelings of consumers on the products. Efficient and accurate mining of their emotional tendencies is very important for merchants' precision marketing.
Comment text sentiment analysis, that is, using algorithms to analyze and summarize the subjective comment text posted by users, and extract the user's sentiment tendency. There are three main types of existing sentiment analysis methods: 1) Based on the emotional dictionary. The sentiment of the review text is calculated according to the polarity intensity of sentiment words marked in the existing sentiment dictionary. The method based on the sentiment dictionary has a large manual workload, has a domain bias, and the cross-domain accuracy will be reduced.
2) Based on machine learning. After the comment text is vectorized, it is classified using machine learning algorithms to obtain sentiment tendencies. Commonly used machine algorithms include naive Bayes, support vector machines, K nearest neighbors, and maximum entropy. These methods perform well, but they all rely heavily on feature selection and parameter adjustment. In order to select the most suitable parameter settings, a lot of tests are required. 3) Based on deep learning. Because Deep School can avoid the limitations of the above two methods, it is widely used in the field of sentiment analysis. This type of method uses word embedding technology (Word2Vec) to convert the review text into a word vector containing semantic information, and then uses CNN, RNN, LSTM and other neural networks to analyze the word vector to obtain text emotion. Many deep learning models have achieved good results in text sentiment analysis. However, most of the existing deep learning models use static word embedding technology, and cannot perform dynamic coding based on contextual semantics, and there is a problem of "polysemous word".
For this reason, this paper proposes a sentiment analysis model combining ALBERT and LSTM. ALBERT can obtain word vectors and sentence relations with semantic information. LSTM can obtain context information in both directions. Experimental results show that the ALBERT-LSTM model has a good effect in sentiment analysis of product reviews.

Related work
Obtaining high-quality word vectors through the word embedding model is the premise of using deep learning to classify the sentiment polarity of the text. Google's word2vec, launched in 2013, uses CBOW and Skip-gram to predict the semantics of words based on the contextual information of the review text and convert them into word vectors. Word2vec can remove redundant information and convert sparse one-hot word vectors into low-dimensional semantic vectors that can express semantic information. However, Word2vec can only represent the semantics of words alone, and cannot solve the problem of synonyms. Therefore, in 2018, Google proposed the deep two-way Transformers pre-training model BERT for language understanding [1]. The multi-head attention mechanism adopted in the BERT model can simultaneously use the context information of the current word to extract features in parallel, and dynamically adjust the word vector with different context information, which solves the "polysemous word" problem in word2vec.
The BERT model has a good generalization ability. If you adjust the pre-trained BERT model for the current task, you can obtain the word vector for the specific task, which can be conveniently used for downstream NLP tasks. Shi et al. proposed a BERT-CNN model that combines pre-trained BERT and convolutional neural network to perform sentiment analysis on JD mobile phone reviews and achieved better results than the BERT or TEXTCNN model alone [2]. Li et al. merged the convolutional neural network and BiLSTM, the convolutional neural network extracts local features, and used BiLSTM to solve the problem of ignoring the contextual meaning of words in the text classification by the convolutional neural network, which improved the accuracy of text classification [3]. Dong et al. used BERT and CNN models to classify the emotional polarity of product reviews [4]. The difference between this paper and the research is that the ALBERT model used in this paper obtains word vectors, and the resource occupation is greatly reduced compared with the BERT model.

ALBERT-LSTM model
The research content of this paper is to analyze the emotional polarity expressed in the text of product reviews. Use ALBERT as the word vector model, and combine it with the LSTM model to form the ALBERT-LSTM model, which is used to classify the emotional polarity of product reviews. The ALBERT-LSTM model is mainly composed of the ALBERT layer, the LSTM layer and the emotional tendency discrimination layer. Its model framework is shown in Figure 1.  Figure 1. ALBERT-LSTM model framework.

ALBERT layer
In order to classify the sentiment of the review text, it first needs to be transformed into a vector form.
BERT is the first unsupervised, deep two-way system for pre-training NLP. Compared with other methods, it enhances the generalization ability of the word vector model and obtains the relationship between sentences. However, the amount of parameters of the BERT model is very large, and the training time is long. Therefore, Lan et al. proposed the ALBERT model, which is improved on the basis of BERT, which greatly reduces the parameter scale and improves the training effect [5]. The ALBERT model, like the BERT model, uses the bidirectional Transformer structure shown in The improvements made by ALBERT based on BERT are mainly in the following four points: • Factorization of embedded layer parameters. The word vector dimension E in BERT is the same as the vector dimension H of the encoding output. ALBERT believes that the output value of the hidden layer also contains contextual information beyond the original meaning of the word, so the dimension of the word vector can be reduced. Through the factorization method, the one-hot vector is mapped to a low-dimensional space of size E, and then restored to the required dimensions of the hidden layer through the high-dimensional space, thereby greatly reducing the parameter scale under the premise of small information loss.
• Cross-layer parameter sharing. It is observed that the attention mechanism parameters between different layers of BERT are similar, so the parameters of the fully connected layer and the attention layer are shared to avoid the problem of increasing parameter scale as the network depth increases. Not only stabilizes the network parameters, but also further reduces the parameter scale.
• Loss of coherence between sentences. The NSP task in the BERT model contains two subtasks, topic prediction and coherence prediction. But topic prediction is much simpler than coherence prediction, and it overlaps with LM loss. So use SOP task instead of NSP task.
• Remove dropout. Dropout is mainly used to prevent the model from overfitting, and ALBERT does not have overfitting problems. Therefore, in order to reduce the parameter scale, ALBERT removes the dropout part.
The training of ALBERT series models requires a lot of data and powerful computing capabilities. Therefore, in the sentiment analysis of product reviews, the pre-trained ALBERT model provided by Google is directly used. The parameters of the four ALBERT models provided by Google are shown in Table 1. Limited by hardware conditions, this article selects the ALBERT base model with the least amount of parameters.  (1).
The input gate is used to update the cell state, and what information needs to be retained is obtained by the sigmoid function, as in equation (2), the temporary cell state t c % is obtained by the tanh function, as in equation (3) The output gate determines the influence of t c on the hidden state t h , as in equation (5), then get the output t h of the unit according to t o and t c , as in equation (6).
Where, σ is the sigmoid activation function, q is the weight matrix, and b is the bias vector. The LSTM layer inputs the word vector matrix output by the ALBERT layer, and is trained by multiple LSTM hidden units to output a vector set.

Emotional orientation discriminant layer
The emotional tendency discrimination layer mainly includes two parts: full connection and softmax. First, the feature vector set 1 2 { , , , } d h h h L representing the entire text obtained by LSTM is input to the fully connected layer. The fully connected layer multiplies the weight matrix with the input vector and adds the offset, and maps d real numbers to K real numbers, as shown in Equation 7.
Where, W is the weight matrix and b is the bias vector. Then, through Softmax, the real number is mapped to the probability distribution of the result combined into 1, and the final result of emotional polarity classification is obtained, as shown in Equation 8.

Dataset
The experimental data set is user reviews of digital products of JD Mall. There are 10,000 comments in this dataset, including 5,000 positive comments and 5,000 negative comments. Divide the entire data set into training set, test set and validation set at a ratio of 6:2:2.

Experimental environment and parameters
Experimental operating environment: The operating system is Ubuntu 18.04, the memory size is 16G, the deep learning framework is TensorFlow 1.12.0, and the programming language is Python 3.6.
Experimental parameters: The pre-training model is ALBERT Base, the vector dimension is 768, the number of layers is 12, the LSTM hidden layer dimension is 128, and the dropout is 0.5.

Evaluation index
In order to evaluate the classification effect of the model, precision rate (P), recall rate (R) and F1 value are used as evaluation indicators. The confusion matrix is a specific table layout that puts the predicted results and the real results in a table to clearly count the model classification results, as shown in Table  2.   Table 2, TP represents a positive example where the prediction is positive, FP represents a negative example where the prediction is positive, FN represents a positive example where the prediction is negative, and TN represents a negative example where the prediction is negative.
The There will be conflicts between P and R. In order to take into account these two indicators, the F1 value is introduced, and the F1 value is obtained by their harmonic mean.

Experimental results and analysis
In order to verify the effectiveness of the model, the ALBERT-LSTM model is compared with the ALBERT model and the LSTM model. The experimental results are shown in Table 3. Comparing the classification results of the above three models, it is found that the ALBERT-LSTM model proposed in this paper achieves better performance than a single ALBERT model and LSTM model in all three indicators. This is because the ALBERT-LSTM model introduces the ALBERT pretraining model to encode word vectors, which has stronger feature extraction capabilities. In addition, the LSTM model can better analyze the context of context and further enhance the accuracy of the sentiment analysis of the ALBERT model.

Conclusion
Aiming at the task of sentiment analysis of product review text, this paper uses the ALBERT pre-training language model to obtain contextual word vectors, and then combines the classic neural network model LSTM to construct the ALBERT-LSTM model. Evaluated on the JD Digital Commodity Review Data Set, and the proposed model achieved the best results compared to other models. The advantage of the ALBERT-LSTM model is that ALBERT can combine the semantic information of the context for pretraining, and learn the characteristics of word level, syntactic structure and semantic information of the context, making this model have better performance than other word vector models. At the same time, combined with LSTM training to learn the semantic information of the ALBERT word vector, it further improves the effect of product review sentiment analysis. In the next step, we plan to apply the analyzed sentiment tendencies to the classification of review users, in order to provide a basis for business promotion decisions.