Two-stream neural network with different gradient update strategies

Deep neural networks will be affected by various noises in different scenes. Traditional deep neural networks often use gradient descent algorithms to update parameter weights. When the gradient falls to a certain range, it is easy to fall into the local optimal solution. Although the impulse method and other methods can escape from local optimization in some scenarios, they still have some limitations, which will greatly reduce the application effect of the actual scenes. To solve the above problems, a two-stream neural network with different gradient update strategies was proposed. Combined with the gradient ascent algorithm, this method alleviated the disadvantage of deep neural networks falling into local optimality and increased the robustness of neural networks to a certain extent. The experimental results on the CIFAR10 dataset verify that the proposed method can improve the accuracy of various gradient descent optimizers by about 1%, such as SGD, Adagrad, RM-Sprop and Adam. The experimental results on the COCO dataset show that the accuracy of the proposed method is also improved compared with the baseline models PAA and EfficientDet. The method proposed can be widely used in various neural network structures and has good practical significance and application prospects.


Introduction
Deep learning (DL) has found extensive applications in various domains such as computer vision [1], object tracking and detection [2], natural language processing [3], speech recognition [4], bioinformatics [5] and computational mechanics [6].In the realm of supervised learning, the effectiveness of DL methods relies not only on the quality of training data but also significantly on the choice of loss function and update methods.Most existing optimization algorithms are optimizing towards the lowest point of the loss function.However, in the case of non-convex functions, it becomes challenging to find the global optimal solution using this approach.To address the issue of existing optimizers often getting trapped in local minima due to greedy algorithms, this paper proposes a new optimization algorithm.Inspired by the Dropout method [7], the optimization algorithm designed in this paper essentially introduces more random noise to the networks, aiding the networks in escaping local minima.This paper breaks away from the conventional pattern of deep learning optimization algorithm design, where gradient ascent is typically avoided because it implies training towards increasing errors and worsening model performance.By introducing a gradient ascent algorithm that is opposite to the regular gradient descent algorithm for parameter updates, random noise is injected into the network.Due to the use of two different parameter update methods, this paper proposes a parallel two-stream network structure.

2
The main contributions of this paper are as follows: ∂ A novel neural network optimization algorithm is designed, which combines gradient ascent and descent algorithms to effectively avoid getting trapped in local optima.∂ A two-stream neural network architecture is proposed to implement the new optimization algorithm.The network is divided into a feature extraction part and a discriminator part with the feature extraction part duplicated to form the two streams.One stream updates its network parameters using the conventional gradient descent algorithm while the other stream randomly selects either gradient ascent or descent algorithm with a certain probability to update its network parameters.The features obtained from merging the two streams are then input to the discriminator for further processing.∂ The effectiveness of the proposed method is demonstrated on the CIFAR10 and COCO datasets, showing improved accuracy compared to baseline methods.

Related work
Stochastic Gradient Descent (SGD) [8] has traditionally been the most popular optimization algorithm for minimizing the loss function.However, SGD suffers from two main issues.Firstly, SGD tends to get trapped in local minimum points.It often causes oscillations near saddle points and areas with very small gradient values.As a result, parameter updates become extremely slow.Secondly, selecting an appropriate learning rate for SGD is often challenging as it requires careful tuning.A learning rate that is too small leads to slow convergence while one that is too large may hinder convergence altogether.
To address these issues, researchers have proposed several techniques.One popular approach is Gradient Descent with Momentum (GDM) [9], which optimizes relevant directions of training and mitigates oscillations along irrelevant directions, thereby accelerating SGD training and enabling escape from local minimum regions.To address the issue of learning rate, more optimization algorithms have emerged to automatically adjust the learning rate based on gradients.These include Adagrad [10], RM-Sprop [11] and Adam [12].One issue with optimization algorithms based on gradient descent is that they tend to get stuck in local optima.In addition to that, genetic algorithms and simulated annealing are also two important multi-objective optimization algorithms.Genetic algorithms utilize population-based search techniques and evolve generations based on the principle of survival of the fittest to eventually obtain the optimal or near-optimal solution.Simulated annealing simulates the thermal equilibrium process in solid materials and leverages the similarity between the problem of random search optimization to find global optimal or approximately global optimal solutions.Some researchers have also proposed using genetic algorithms for optimizing neural networks [13], using simulated annealing to find optimal solutions [14] and even combining both approaches [15].In practical applications, genetic algorithms are prone to premature convergence problems.Simulated annealing has limited knowledge about the overall search space, which makes it difficult for the search process to enter the most promising areas, resulting in relatively low computational efficiency.

Optimization algorithm design
Like the commonly used Dropout regularization method in neural networks [7], during forward propagation, it makes certain neuron activation values stop working with a certain probability.Its essence lies in injecting randomness into the model by weakening the joint adaptability between neuron nodes, thereby enhancing the network's generalization ability.The proposed gradient ascent stream interferes with the gradient descent process with a certain probability, making it highly unlikely for a point to have zero gradient under the interference of this noise.Therefore, the stable model should be situated in a relatively flat region lower than the surrounding areas.This encourages the model to continually attempt to escape the current local values and discover better solutions.
The improvement observed in the experimental results presented in Chapter 4 validates the effectiveness of introducing gradient ascent flow as random noise in the networks.The introduction of random noise indeed enhances the generalization ability of the networks.Updating parameters in the opposite direction of the gradient helps the current optimization algorithm escape local optima.This also indicates that existing gradient descent algorithms often tend to get trapped in local optima.

Building two-stream network
Since the new optimization algorithm requires the combination of two different update mechanisms, it is necessary to construct a two-stream neural network.The model structures of the two networks are identical and can be fused at any layer in theory.However, in this paper, the fusion is performed at the last layer of the feature extraction network.The fusion at different layers has two main impacts.Firstly, it affects the number of parameters.Since the fully connected layers have a large number of parameters, fusing before the fully connected layers can reduce some parameters to some extent.Secondly, the information fused at different layers differs.Fusion at the feature extraction network can obtain more robust feature maps, while fusion at the fully connected layers mainly combines the classification scores or prediction results of the two networks.Therefore, it is necessary to separate the original network model into the feature extraction part and the discriminator (classifier, object detector, etc.) part.Since this method only requires duplicating the feature extraction part, it is not limited to specific networks and can be combined with any existing network.
The gradient descent and ascent streams are equipped with their respective optimizers.In the gradient descent stream, the parameters are updated by using the gradient descent algorithm at each iteration.However, the gradient ascent stream selects either the gradient descent or the gradient ascent algorithm with a specified probability.If the gradient ascent algorithm is chosen continuously, it creates two completely opposing flows, which can easily lead to the loss being unable to decrease, resulting in the model constantly oscillating.
The mathematical expression of the gradient descent algorithm is given by: 10 () The mathematical expression of the gradient ascent algorithm is given by: 10 () where presents the updated value of the parameters to be learned in a deep network, represents the current value of the parameters to be learned, denotes the step size and () Jπ ∠ represents the partial derivative of the loss function J with respect to the current parameters, which also indicates the gradient at the current position.

Two-stream network fusion
When the gradient descent network stabilizes, meaning that the network no longer converges further, the training of the gradient ascent stream begins, as indicated by the arrow in  During the training with a single stream, only the feature maps extracted by the gradient descent stream are inputted into the discriminator.After starting the training of the two-stream, the fused feature maps from both streams need to be inputted into the discriminator.Taking inspiration from the fusion method in [16], this paper chooses the feature fusion approach of weighted summation.
The formula for feature-weighted summation is given as Equation (3): * * ( 1) where represents the proportion of the result generated by the gradient descent stream in the fusion process.As the value of increases, the fusion results tend to lean more toward the gradient descent stream.

Experiment and result analysis
The purpose and contribution of this paper is to propose a general deep network training method that can further enhance existing networks rather than designing a new network specifically for object classification or detection tasks.Therefore, the improvement of the existing networks demonstrates the effectiveness of the proposed method in this paper.

Image classification experiment based on CIFAR10
The CIFAR10 dataset [17] consists of 60,000 color images with a size of 32×32 pixels.It comprises a total of 10 different classes, each containing 6,000 images, representing real-world objects.The dataset is divided into a training set and a test set.The training set contains 50,000 images with 5,000 images per class, while the test set contains 10,000 images with 1,000 images per class.
Experiments were conducted by using ResNet34 [18] on the CIFAR10 dataset.ResNet is often chosen as the backbone network for computer vision tasks due to its relatively small size and good performance.Figure 2 illustrates the improved structure of ResNet34 where a two-stream architecture is formed by duplicating the feature extraction part.Each stream is trained on separate image data and feature fusion is performed before the fully connected layers for classification.were conducted independently, meaning they were trained from scratch rather than continuing from the previous experiments.The proposed approach in this paper sets a probability of 0.3 for using the gradient ascent algorithm to update parameters in the gradient ascent stream.When performing feature fusion, the weight of the gradient descent stream is set to 0.8, while the weight of the gradient ascent stream is set to 0.2.The selection of these parameters was determined by using the method of controlled variables.From Table 1, it can be observed that when the probability of choosing the gradient ascent algorithm is the same, the best results are achieved when the weight of the gradient descent stream is 0.8 and the weight of the gradient ascent stream is 0.2.The method of selecting fusion weights is the same.In terms of parameter selection, the emphasis should lean towards the gradient descent algorithm since the overall model updates are optimized primarily through gradient descent.
Through the comparative experiments, it can be observed that in the experiments of 20 and 200 epochs, the proposed method in this paper can converge in fewer iterations, reaching the inflection point of the accuracy curve earlier.Additionally, the proposed method significantly improves the final classification accuracy.This further demonstrates that this method can overcome local optima and find better solutions.To validate the generalizability of the proposed algorithm, the original optimizer SGD was modified to Adagrad, RM-Sprop and Adam.The CIFAR10 dataset was trained for 100 epochs.Table 2 presents the accuracy.It can be observed that the proposed algorithm applies to any existing optimizer based on gradient descent.All of them show performance improvements.To exclude the performance gain due to an increase in spatial cost, the paper compared it with two alternative schemes.Scheme 1 involved doubling the channel size of the original algorithm, thus achieving the same spatial cost as the proposed algorithm.Scheme 2 maintained the two-stream structure but replaced the gradient ascent algorithm in the gradient ascent stream with a gradient descent algorithm, resulting in two gradient descent streams.By training for 100 epochs, Table 3 presents the comparison results.It can be observed that simply increasing the spatial cost does not lead to performance improvements, thus demonstrating that the core of the proposed algorithm lies in fusing the gradient ascent stream to escape local optima.[20] introduces an adaptive approach based on a Gaussian mixture model to assign positive and negative labels to anchors probabilistically, depending on the training state of the model.In this paper, ResNet50+FPN and ResNext101+FPN were used as backbones.The configuration files for these models are named "paa_R_50_FPN_1x" and "paa_dcnv2_X_101_32x8d_FPN_2x" respectively.Figure 4 demonstrates how to construct a twostream network when the network becomes more complex and the discriminator is not just a single fully connected layer.It can be observed that in the improved structure, the two-stream is formed by duplicating the backbone part.Each stream is trained on separate image data.Feature fusion is performed before extracting candidate regions in the Region Proposal Network.The PAA algorithm uses the SGD optimizer.From Table 4, it can be observed that the proposed algorithm in this paper improves the accuracy of the original models to varying degrees.[22] image classification task.EfficientDet [23] can be seen as an extension of EfficientNet, expanding from classification tasks to detection tasks.The main contribution of EfficientDet is the introduction of a weighted bi-directional feature pyramid network (BiFPN), which allows for simple and fast multi-scale feature fusion.Additionally, a compound feature pyramid network scaling method is proposed, unifying the scalability of all backbones' resolution, depth, width, feature network and box/class prediction network.The selected configurations for this experiment are EfficientDet D0 and D2, which use EfficientNet B0 and B2 as their backbones.Similar to the previous experimental improvement approach, a two-stream network is created by replicating the backbone and BiFPN structures.The fused features from both streams are inputted into the discriminator to obtain classification loss and bounding box loss.The optimizer used in EfficientDet is Adamw [24].According to Table 5, the proposed algorithm in this paper achieves at least a 1.2% accuracy improvement over the original model.

Conclusion
This paper proposes a new neural network optimization algorithm, which differs from the conventional understanding that gradient ascent algorithms are not beneficial for neural network optimization.The paper argues that when the network reaches a bottleneck, specifically being trapped in a local optimum, the gradient descent algorithm is no longer effective in optimizing the network.Instead, triggering the gradient ascent algorithm to update parameters can help the network find better solutions.To combine the gradient ascent and descent algorithms, this paper designs a two-stream network.This network replicates the feature extraction part of the original network to form two parallel streams.Different methods of backward parameter updates are implemented in these two streams.Through extensive experiments in image classification and object detection tasks, it has been demonstrated that the proposed method in this paper can enhance the performance of the original algorithm.The algorithm proposed in this paper is versatile and can be applied to any optimizer based on the gradient descent algorithm.

Figure 1 .
Due to the identical structure of the two-stream network and to save training time and resources, when training the gradient ascent stream for the first time, the initial parameter values are set to be the same as the current parameter values of the gradient descent stream.

Figure 1 .
Figure 1.Example of two streams fusion condition.

Figure 3 .
Figure 3. Classification results of CIFAR10 by Resnet34 under different schemes.

Table 2 .
Accuracy of the algorithm proposed applied by different optimizers (unit: %).

Table 3 .
[19]racy of different algorithms with the same spatial cost (unit: %).COCO[19]is primarily composed of images captured from complex everyday scenes.The version used in this experiment is COCO2017, which consists of 80 different categories.The training set contains 118,287 images, while the validation set contains 5,000 images.Object detection experiment based on PAA.PAA

Table 4 .
Experimental results of target detection based on PAA (unit: %).