Cross-Domain Sequential Recommendation Based on Self-Attention and Transfer Learning

Cross-domain recommendation provides effective solutions to the problems of cold start and data sparsity by transferring information from one or more source domains to another target domain. However, people’s interactive behaviour usually occurs in a sequence, but most cross-domain recommendation models often ignore the sequential information. At the same time, most sequential recommendation models only consider the case of single-domain. In this paper, we propose a new cross-domain sequential recommendation model (SATLR) based on self-attention and transfer learning, which make recommendations based on user preferences and sequential dependencies learn from cross-domain. The experiment results demonstrate that on the three groups of datasets from amazon, our model is better than other recommendation models.


Introduction
With the development of information technology, it is becoming increasingly difficult for people to obtain the information they need from the massive data. In order to solve the problem of information overload, recommender system is proposed [1]. The traditional collaborative filtering mainly based on the user's historical feedback, but it is difficult to recommend for new users or new items without historical feedback, leading to the problems such as cold start and data sparsity [2]. Cross-domain recommendation has become a solution to these problems [3]. It can transfer information from one or more source domains to another target domain to alleviate the lack of data [4]. Studies have shown that the recommendation effect can be improved by combining information from different domains [5] [6].
The items purchased by one user usually come from multiple domains. As shown in figure 1, one user may read several books and also watched a few movies in a certain period of time. User behaviour in different domains is often not independent, and most of them are related to user preferences. In crossdomain recommendation, user preferences are divided into two parts: one part is the user preferences within the domain, and the other part is the cross-domain overlapping user preferences. However, most of the current cross-domain recommendation methods [7] [8] focus on the factors such as item features and user preferences, ignoring the information of the user's behaviour sequences. examples. The user's shopping behaviour usually occurs in a sequence, which means that the next behaviour may depend on the previous behaviour. Sequential recommendation can capture user's habits, changes in user preferences, etc. Based on user's historical interactions, sequential recommendation can make the recommendation effect better [9]. The current sequential recommendation methods include [10] [11]. But most of the existing models only consider the user's interaction in the same domain, without the connection between different domains. The interactive sequence in other domains may also affect the interactive behaviour of this domain, for example: after watching a movie, the user may continue to read its original novel. Therefore, we consider combining sequential recommendation with cross-domain recommendation to improve the shortcomings that most of the cross-domain recommendation models do not consider the sequential dependencies and most of the sequential recommendation models only consider the single-domain sequences and to obtain better recommendation results based on richer information.
In this paper, we propose a new cross-domain sequential recommendation model (SATLR) based on self-attention and transfer learning, which considers the factor of the interaction sequences in crossdomain recommendation. In addition to user preferences, the sequential dependencies of each domain also participate in the transfer. We first model the sequences of each domain, and then incorporate the information of sequential dependencies into the code of the user's interaction behaviour. Transfer learning is adopted to model the joint distribution of information in different domains, so as to achieve the purpose of transferring user preferences and sequential dependencies between different domains.
In general, our contributions are summarized as follows:  We propose a new cross-domain sequential recommendation model. The interaction sequences are integrated into cross-domain recommendation, so that both user preferences and sequential dependencies are involved in transfer learning.  We integrate cross-domain recommendation and sequential recommendation, retaining the advantages of both. We try to build bridges between the two.  The experiment results show that our model makes improvement in HR, NDCG and MRR compared to several single-domain and cross-domain recommendation models.

Related work
In this section, we provide a brief review of related works. We mainly divide these related works into the following three part.

Cross-domain recommendation
Generally, cross-domain recommendation can be divided into two categories [4]: one is to aggregate information across different domains [12], and the other is to transfer information from the source domain to the target domain by sharing potential features or rating patterns [13] [14]. In order to build more complex cross-domain models, several methods based on deep learning are proposed for cross-domain recommendation. PPGN [15] constructed the preference propagation graph network to propagate user preferences. JSCN [8] uses a joint spectral convolution network to fuse information from multiple domains. BITGCF [16] proposes a novel Bi-directional transfer learning method. In addition, CD-DNN [17] combines the user's comments in cross-domain recommendation.
However, most of the previous studies mainly consider user preferences and item features in the transfer between domains, less considering the user's interaction sequences.

Sequential Recommendation
In recent years, sequential recommendation is also one of the research hotspots. It can treat the interactions as dynamic sequences, and transform the sequential dependencies into dynamic user preferences to make better recommendations [18].
For low-order sequential dependencies, Garcin et al. [10] adopt the Markov model. Hidasi et al. [11] use factorization machines for modeling, decomposing the interaction matrix into two low-rank matrices as potential factors. In order to learn high-order sequential dependencies, the main models currently use such as the high-order Markov model [19] and the recurrent neural network model [20], etc. Recently, mainstream sequential recommendation models also often use several sequential modeling techniques which are widely used in the field of natural language processing (NLP) [21] [22]. But most of the existing models only consider the user's interaction history within the domain.

Cross-domain sequential recommendation
Recently, several studies also attempt to consider both of cross-domain recommendation and sequential recommendation. CDNST [23] uses cross-domain novelty seeking trait for sequential recommendation. According to the user's historical behaviour, the preference tendency from the perspective of psychology is modeled. SCLSTM [24] adds the rating similarity, feature similarity and time similarity across different domains to the sequences to make sequential recommendations, and the LSTM is improved.
These models consider the features of the items, and our model focuses on the interactions between the users and the items.

Method
In this section, we formulate the problem and introduce the details about the model SATLR we proposed.

Problem formulation
In this paper, the cross-domain sequential recommendation problem is first simplified into a triplet of the form , , . The same set of users shared by items from two different domains X and Y are denoted as U. Therefore, u in the triplet denotes u∈U={ , , … , , where j represents the number of users. Which domain the items come from is denoted as a, and the set of items from different domains is denoted as . Therefore, the sets of items from domain X and Y are respectively denoted as ∈ ={ , , … , and ∈ ={ , , … , , where n and m are the numbers of items in domain X and Y respectively. The user-item interactions of domain X and Y could be respectively represented as the interaction matrices ∈ ℝ and ∈ ℝ . The timestamp of the interaction in different domains is denoted as . Therefore, the triplet reflects that in the domain a, the user u interacts with the item at the time . S represents the interaction sequence. According to , we can obtain the sequence of the user u in each domain as ∈S= , , … , , where b represents the length of the interaction sequence.

Overview
In this paper, we first model the user's interaction sequences in different domains. This part of the modeling is based on self-attention. The information of sequential dependencies is merged with the useritem interaction information, and it is transferred between the two domains by transfer learning. The encoders and decoders are used to encode and decode the embeddings in the cross-domain  Figure 2 shows the overall framework of the model SATLR we proposed.
In the next parts, we will introduce the details of modeling. Figure 2. The framework of SATLR.

Sequence modeling
For each domain, we model its own embedding layer. Figure 3 shows the framework of sequence modeling based on self-attention. Taking the domain X as an example, the item embedding matrix of domain X is denoted as ∈ ℝ , where d is the potential embedding vector dimension. According to the interaction sequence , the input interaction sequence s = ( , , … , ) is obtained, where N represents the length of the input sequence in the embedding layer. If the length of is greater than N, only the most recent N actions are taken. And if the length of is less than N, 0 is added to the left until the length is N. In other words, for the user's interaction sequences, only the latest N interactions are considered, and it is thought that the earlier interactions may have little reference value for the sequences. The input embedding matrix is denoted as ∈ ℝ .  Self-attention uses global information, so it integrates all relevant interactions into the interaction being processed, without perceiving the position of the interactions in the sequence. Therefore, the position embedding is added to the item embedding, and the new embedding is represented by . The scaled dot-product attention [21] is defined as: , , softmax √ 1 where Q, K and V represent the queries, keys and values respectively, and √ is to prevent the inner product value is too large, so that more stable gradient could be made.
The self-attention layer [25] is defined as: where , , ∈ ℝ are variable linear matrices, which can be adjusted during model training.
The point-wise two-layer feed-forward network (FFN) [25] is defined as: ∈ ℝ , and , are d-dimensional vectors. As shown in the figure 3, one block composes of a self-attention layer and a feed-forward network, and the block can be stacked multiple times.

Joint distribution modeling
For the joint distribution modeling, and are used to represent the interaction behaviour of user in domains X and Y, which can be regarded as the row vectors of the matrices and . As shown in the figure 2, the outputs of and encoded by the encoders and are denoted as and respectively. The vectors that describe the user's interaction sequences are obtained by sequence modeling, and the new vectors through the fully connected layer are denoted as and . , and the corresponding and are spliced, and the new vectors are denoted as and . When splicing, we provide the variable matrices and for the user : (4) and contain not only interactive behaviour information that can reflect user preferences, but also interactive sequential information that can reflect sequential dependencies. The transfer of and between two domains [26] is defined as: ∈ ℝ represents the orthogonal mapping matrix. The decoder is used to decode and → into , which is the reconstructed . Similarly, the decoder is used to decode and → into the reconstructed . 、 、 and are all twolayer MLP, and ReLU is used as the activation function.
The interactions of the same user in the two domains is represented by a pair of samples , . , represents the probability density function [26]:

Objective function and loss calculation
In this paper, the loss function is divided into three parts: sequence modeling loss, joint reconstruction loss and prior regularization loss. The following is a detailed introduction. In the first part, according to [25], the binary cross entropy loss is adopted as the objective function: where , represents the correlation of the j-th item is regarded as the next interactive item, which is calculated in the prediction layer, according to the output of FFN. e represents the negative item which is randomly generated at each time step in each sequence.
As in equation (7), the first term on the right side of the equation represents the joint reconstruction loss and the second term represents the prior regularization loss of and . We make the following proposition: given and , and are conditional independent, given and , and are conditional independent. Therefore, refer to [26], joint reconstruction loss is defined as: where Ф , Ф are the parameters of discriminators and . The discriminators are two-layer MLP with ReLU as the activation function, using Gaussian distributions.
The total loss is calculated as: (11) where α, β, γ are adjustable parameters, which respectively represent the weight of each loss.

Experiments
In this section, we introduce the details of the datasets and the experiments, and give some analysis.

Dataset
We use the three larger datasets from Amazon: Movies and TV (Movie), Books (Book), CDs and Vinyl (Music). They are grouped in pairs to form three cross-domain datasets for experiment and evaluation: Movie & Book, Book & Music and Movie & Music. In this paper, implicit feedback is used, that is, the user's ratings as interactions. The rating is an integer from 0 to 5, and the interactions with rating from 3 to 5 are set as positive samples. The overlapping users of the two domains in each group of datasets are selected, and the users who only interact with the single domain are filtered out. We filter out the users and items that interact less than 5 times in one domain. Table 1 shows the statistics of datasets. Density reflects the ratio of the actual number of interactions to the number of possible interactions that all users may have on all items. It can be seen that the data is relatively sparse.

Baselines
To illustrate the effectiveness, our experiment is compared with several single-domain recommendation methods (CDAE, CFVAE, CMF) and cross-domain recommendation methods (CoNet, DARec, DDTCDR, ETL):  CDAE [28]: It proposes a collaborative denoising auto-encoder for recommendation.  CFVAE [29]: It proposes a multinomial likelihood generative model and Bayesian inference is used to estimate the parameters.  CMF [30]: It uses matrix decomposition to solve the problem of one entity corresponding to multiple relations, sharing parameters among relations.  CoNet [7]: It uses a modified cross neural network to transfer knowledge between different domains.  DARec [31]: It adopts a deep domain adaptation model to learn the invariant user representation from different domains, it can extract and transfer patterns from rating matrices.  DDTCDR [32]: It adopts a dual learning mechanism, using automatic coding technology to construct feature embeddings, and learns cross-domain potential orthogonal mapping to extract user preferences in multiple domains.  ETL [26]: It uses equivalent transformation to capture overlapping attributes between different domains and specific attributes of each domain.

Experiments and Results
We adjust various parameters and conduct multiple experiments. We set the learning rate∈{0.0008, 0.001, 0.002}, hidden units∈{100, 120   Refer to [26], We randomly select one of the interactions of each user to form the verification set, and one interaction to form the test set, and randomly select 99 non-interactive items as negative samples. We use three evaluation indicators: hit ratio (HR), normalized discounted cumulative gain (NDCG), and mean reciprocal rank (MRR), taking the top5 and top10 of the predicted rankings for experiments. The top5 and top10 of the same indicator are represented by @5 and @10 respectively. For example, HR@5 shows the HR indicator calculated from top5 predicted rankings.
For each pair of datasets, we inspect the recommendation results for each domain separately. Figure  4, 5, 6 show the results of our experiments on the three cross-domain datasets respectively. For the models marked with (*), the results refer to [26]. In the figure 4, 5, 6, each polyline represents the result of one indicator of eight models separately. The results of SATLR we proposed are on the most right of the polylines. According to all the HR, NDCG, and MRR measures, SATLR achieves the best performance on all the three cross-domain datasets.
Through the experiments, when the learning rate is 0.001, the length of the sequence is 200, the number of blocks is 2, α=1.0, γ=1.0 of several evaluation indicators. Book & Music has the most improvement in HR for top5 in the book domain, which is 5.12%. Book & Music has the most improvement in NDCG for top5 in the book domain, which is 6.64%, and for top10 also increases by 6.00%. Book & Music has the largest increase in MRR for top5 in the book domain, which is 7.67%, and for top10 also increases by 5.37%.
In general, these results show that cross-domain recommendation is improved greatly by combining the information of sequential dependencies and user preferences.

Conclusion
In this paper, we introduce a cross-domain sequential recommendation model based on self-attention and transfer learning. We model the sequence of interactions, adding sequential dependencies to crossdomain recommendation, so as to improve the previous shortcoming that most of the cross-domain recommendations only consider user preferences. We combine user preferences and sequential dependencies to obtain the patterns within the domain and make better recommendations based on the patterns within the domain and learn from other domains.
In our current work, the constructed sequences only consider the sequences of the user's interactions, but does not consider the time interval between interactions. In the future work, the information of time interval can be tried to add in the sequence modeling to achieve better modeling results. Moreover, the current experiments are only completed on two domains. In the future work, the recommendations may be made across three or more domains.