Fractional Deep Neural Network via Constrained Optimization

This paper introduces a novel algorithmic framework for a deep neural network (DNN), which in a mathematically rigorous manner, allows us to incorporate history (or memory) into the network -- it ensures all layers are connected to one another. This DNN, called Fractional-DNN, can be viewed as a time-discretization of a fractional in time nonlinear ordinary differential equation (ODE). The learning problem then is a minimization problem subject to that fractional ODE as constraints. We emphasize that an analogy between the existing DNN and ODEs, with standard time derivative, is well-known by now. The focus of our work is the Fractional-DNN. Using the Lagrangian approach, we provide a derivation of the backward propagation and the design equations. We test our network on several datasets for classification problems. Fractional-DNN offers various advantages over the existing DNN. The key benefits are a significant improvement to the vanishing gradient issue due to the memory effect, and better handling of nonsmooth data due to the network's ability to approximate non-smooth functions.


Introduction
Deep learning has emerged as a potent area of research and has enabled a remarkable progress in recent years spanning domains like imaging science [26,3,50,30], biomedical applications [33,13,25], satellite imagery, remote sensing [47,51,10], etc. However, the mathematical foundations of many machine learning architectures are largely lacking [20,39,49,41,18]. The current trend of success is largely due to the empirical evidence. Due to the lack of mathematical foundation, it becomes challenging to understand the detailed workings of networks [22,35].
The overarching goal of machine learning algorithms is to learn a function using some known data. Deep Neural Networks (DNN), like Residual Neural Networks (RNN), are a popular family of deep learning architectures which have turned out to be groundbreaking in imaging science. An introductory example of RNN is the ResNet [26] which has been successful for classification problems in imaging science. Compared to the classical DNNs, the innovation of the RNN architecture comes from a simple addition of an identity map between each layer of the network. This ensures a continued flow of information from one layer to another. Despite their success, DNNs are prone to various challenges such as vanishing

Preliminaries
The purpose of this section is to introduce some notations and definitions that we will use throughout the paper. We begin with Table 1 where we state the standard notations. In subsection 2.1 we describe the well-known softmax loss function. Subsection 2.2 is dedicated to the Caputo fractional time derivative.

Symbol
Description n ∈ N Number of distinct samples n f ∈ N Number of sample features n c ∈ N Number of classes N ∈ N Number of network layers (i.e. network depth) Y ∈ R n f ×n Y = {y (i) } n i=1 is the collective feature set of n samples. C obs ∈ R nc×n C obs = {c (i) } n i=1 are the true class labels of the input data W ∈ R nc×n f Weights K ∈ R n f × n f Linear operator (distinct for each layer) b ∈ R Bias (distinct for each layer) P ∈ R n f ×n Lagrange multiplier e nc ∈ R nc A vector of ones τ ∈ R Time step-length σ(·) Activation function, acting pointwise γ Order of fractional time derivative (·) ′ Derivative w.r.t. the argument tr(·) Trace operator (·) ⊺ Matrix transpose ⊙ Point-wise multiplication m 1 Max count for randomly selecting a mini-batch in training m 2 Max iteration count for gradient-based optimization solver α train , α test Percentage of training and testing data correctly identified 2.1. Cross Entropy with Softmax Function. Given collective feature matrix Y with true labels C obs and the unknown weights W , the cross entropy loss function given by E(W, Y, C obs ) = − 1 n tr(C ⊺ obs log(S(W, Y ))) (1) measures the discrepancy between the true labels C obs and the predicted labels log(S(W, Y )). Here, is the softmax classifier function, which gives normalized probabilities of samples belonging to the classes.
Moreover, if γ = 1 and u ∈ C 1 ([0, T ]), then one can show that d γ t u(t) = u ′ (t) = d γ T −t u(t). We note that the fractional derivatives in (3) and (4) are nonlocal operators. Indeed, the derivative of u at a point t depends on all the past and future events, respectively. This behavior is different than the classical case of γ = 1.
The left and right Caputo fractional derivatives are linked by the fractional integration by parts formula, [5,Lemma 3], which will be stated next. For γ ∈ (0, 1), let

Lemma 2.3 (Fractional Integration-by-Parts
). For f ∈ L γ and g ∈ R γ , the following integration-by-parts formula holds: where I 1−γ t w(t) and I 1−γ T −t w(t) are the left and right Riemann-Liouville fractional integrals of order γ and are given by

Continuous Fractional Deep Neural Network
After the above preparations, in this section, we shall introduce the Fractional-DNN. First we briefly describe the classical RNN, and then extend it to develop the Fractional-DNN. We formulate our problem as a constrained optimization problem. Subsequently, we shall use the Lagrangian approach to derive the optimality conditions. 3.1. Classical RNN. Our goal is to approximate a map F . A classical RNN helps approximate F , for a known set of inputs and outputs. To construct an RNN, for each layer j, we first consider a linear-transformation of Y j−1 as, where the pair (K j , b j ) denotes an unknown linear operator and bias at the j th layer. When N > 1 then the network is considered "deep". Next we introduce non linearity using a nonlinear activation function σ (e.g. ReLU or tanh). The resulting RNN is, where τ > 0 is the time-step. Finally, the RNN approximation of F is given by, with θ = (K j , b j ) as the unknown parameters. In other words, the problem of approximating F using classical RNN, intrinsically, is a problem of learning (K j , b j ). Hence, for given datum (Y 0 , C), the learning problem then reduces to minimizing a loss function J(θ, (Y N , C)), subject to constraint (6), i.e., Notice that the system (6) is the forward-Euler discretization of the following continuous in time ODE, see [26,23,41], The continuous learning problem then requires minimizing the loss function J at the final time T subject to the ODE constraints (8): Notice that designing algorithms for the continuous in time problem (9) instead of the discrete in time problem (7) has several key advantages. In particular, it will lead to algorithms which are independent of the neural network architecture, i.e., independent of the number of layers. In addition, the approach of (9) can help us determine the stability of the neural network (7), see [9,24]. Moreover, for the neural network (7), it has been noted that as the information about the input or gradient passes through many layers, it can vanish and "wash out", or grow and "explode" exponentially [8]. There have been adhoc attempts to address these concerns, see for instance [45,16,27], but a satisfactory mathematical explanation and model does not currently exist. One of the main goals of this paper is to introduce such a model.
Notice that (8), and its discrete version (6), incorporates many algorithmic processes such as linear solvers, preconditioners, nonlinear solvers, optimization solvers, etc. Furthermore, there are well-established numerical algorithms that re-use information from previous iterations to accelerate convergence, e.g. the BFGS method [37], Anderson acceleration [1], and variance reduction methods [40]. These methods account for the history Y j , Y j−1 , Y j−2 , . . . , Y 0 , while choosing Y j+1 . Motivated by these observations we introduce versions of (6) and (8) that can account for history (or memory) effects in a rigorous mathematical fashion.

3.2.
Continuous Fractional-DNN. The fractional time derivative in (3) has a distinct ability to allow a memory effect, for instance in materials with hereditary properties [11]. Fractional time derivative can be derived by using the anomalous random walks where the walker experiences delays between jumps [36]. In contrast, the standard time derivative naturally arises in the case of classical random walks. We use the idea of fractional time derivative to enrich the constraint optimization problem (9), and subsequently (7), by replacing the standard time derivative d t by the fractional time derivative d γ t of order γ ∈ (0, 1). Recall that for γ = 1, we obtain the classical derivative d t . Our new continuous in time model, the Fractional-DNN, is then given by (cf. (8)), where d γ t is the Caputo fractional derivative as defined in (3). The discrete formulation of Fractional-DNN will be discussed in the subsequent section.
The main reason for using the Caputo fractional time derivative over its other counterparts such as the Riemann Liouville fractional derivative is the fact that the Caputo derivative of a constant function is zero and one can impose the initial conditions Y (0) = Y 0 in a classical manner [42]. Note that d γ t is a nonlocal operator in a sense that in order to evaluate the fractional derivative of Y at a point t, we need the cumulative information of Y over the entire sub-interval [0, t). This is how the Fractional-DNN enables connectivity across all antecedent layers (hence the memory effect). As we shall illustrate with the help of a numerical example in section 6, this feature can help overcome the vanishing gradient issue, as the cumulative effect of the gradient of the precedent layers is less likely to be zero.
Since γ ∈ (0, 1), therefore d γ t Y (t) at t = 0 is zero. Owing to Remark 3.1 we can better account for features, Y , which are non-smooth, as a result of which the smoothness requirement on the unknown parameters θ can be weakened. This, in essence, can help with the exploding gradient issue in DNNs.
The generic learning problem with Fractional-DNN as constraints can be expressed as, Note that the choice of J depends on the type of learning problem. We will next consider a specific structure of J given by the cross entropy loss functional, defined in (1).

Continuous Fractional-DNN and Cross Entropy Loss Functional.
Supervised learning problems are a broad class of machine learning problems which use labeled data. These problems are further divided into two types, namely regression problems and classification problems. The specific type of the problem dictates the choice of J in (11). Regression problems often occur in physics informed models, e.g. sample reconstruction inverse problems [3,25]. On the other hand, classification problems occur, for instance, in computer vision [43,15]. In both the cases, a neural network is used to learn the unknown parameters. In the discussion below we shall focus on classification problems, however, the entire discussion directly applies to regression type problems. Recall that the cross entropy loss functional E, defined in (1), measures the discrepancy between the actual and the predicated classes. Replacing, J in (11) by E together with a regularization term R(W, K(t), b(t)), we arrive at Note that, in this case, the unknown parameter θ := (W, K, b), where K and b are, respectively, the linear operator and bias for each layer, and the weights W are a feature-to-class map. Furthermore, σ is a nonlinear activation function and (Y 0 , C obs ) is the given data, with C obs as the true labels of Y 0 . To solve (12), we rewrite this problem as an unconstrained optimization problem via the Lagrangian functional and derive the optimality conditions. Let P denote the Lagrange multiplier, then the Lagrangian functional is given by, where, ·, · := T 0 ·, · F dt is the L 2 -inner product, and ·, · F is the Frobenius inner product. Using the fractional integration-by-parts from (5), we obtain Let (Y , W , K, b; P ) denote a stationary point, then the first order necessary optimality conditions are given by the following set of state, adjoint and design equations: The gradient of L with respect to P at (Y , W , K, b; P ) yields the state equation ∇ P L(Y , W , K, b; P ) = 0, equivalently, where d γ t denotes the left Caputo fractional derivative (3). In (14), for the state variable Y , we solve forward in time, therefore we call (14) as the forward propagation.
Next, the gradient of L with respect to Y at (Y , W , K, b; P ) yields the adjoint equation ∇ Y L(Y , W , K, b; P ) = 0, equivalently, where d γ T −t denotes the right Caputo fractional derivative (4) and S is the softmax function defined in (2). Notice that the adjoint variable P in (15), with its terminal condition, is obtained by marching backward in time. As a result, the equation (15) is called backward propagation.
and ∇ b L(Y , W , K, b; P ) to zero, respectively, yields the design equations (with (W , K, b) as the design variables), for almost every t ∈ (0, T ). In view of (A)-(C), we can use a gradient based solver to find a stationary point to (12).
Remark 3.2. (Parametric Kernel K(ψ(t))). Throughout our discussion, we have assumed K(t) to be some unknown linear operator. We remark that a structure could also be prescribed to K(t), parameterized by a stencil ψ. Then, the kernel is K(ψ(t)), and the design variables now are θ = (W, ψ, b). Consequently, K(ψ(t)) can be thought of as a differential operator on the feature space, e.g. discrete Laplacian with a five point stencil. It then remains to compute the sensitivity of the Lagrangian functional w.r.t. ψ to get the design equation. Note that this approach can further reduce the number of unknowns.
Notice that so far the entire discussion has been at the continuous level and it has been independent of the number of network layers. Thus, it is expected that if we discretize (in time) the above optimality system, then the resulting gradient based solver is independent of the number of layers. We shall discretize the above optimality system in the next section.

Discrete Fractional Deep Neural Network
We shall adopt the optimize-then-discretize approach. Recall that the first order stationarity conditions for the continuous problem (12) are given in (14), (15), and (16). In order to discretize this system of equations, we shall first discuss the approximation of Caputo fractional derivative.

Approximation of Caputo
Derivative. There exist various approaches to discretize the fractional Caputo derivative. We will use the L 1 -scheme [5,46] to discretize the left and right Caputo fractional derivative d γ t u(t) and d γ T −t u(t) given in (3) and (4), respectively. Consider the following fractional differential equation involving the left Caputo fractional derivative, for 0 < γ < 1, We begin by discretizing the time interval [0, T ] uniformly with step size τ , Then using the L 1 -scheme, the discretization of (17) is given by where coefficients a k are given by, Next, let us consider the discretization of the fractional differential equation involving the right Caputo fractional operator, for 0 < γ < 1, Again using L 1 -scheme we get the following discretization of (20): The example below illustrates a numerical implementation of the L 1 -scheme (18).

4.2.
Discrete Optimality Conditions. Next, we shall discretize the optimality conditions given in (14) - (16). Notice that, each time-step corresponds to one layer of the neural network. It is necessary to do one forward propagation (state solve) and one backward propagation (adjoint solve) to derive an expression of the gradient with respect to the design variables.
(A) Discrete State Equation.
We use the L 1 scheme discussed in (18) to discretize the state equation (14) and arrive at We use the L 1 scheme discussed in (21) to discretize the adjoint equation (15) and arrive at (C) Discrete Gradient w.r.t. Design Variables. For j = 0, . . . , N − 1, the approximation of the gradient (16) with respect to the design variables is given by, Whence, we shall create a gradient based method to solve the optimality condition (24)- (26). We reiterate that each computation of the gradient in (26), requires one state and one adjoint solve.

Fractional-DNN Algorithm
Fractional-DNN is a supervised learning architecture, i.e. it comprises of a training phase and a testing phase. During the training phase, labeled data is passed into the network and the unknown parameters are learnt. Those parameters then define the trained Fractional-DNN model for that type of data. Next, a testing dataset, which comprises of data previously unseen by the network, is passed to the trained net, and a prediction of classification is obtained. This stage is known as the testing phase. Here the true classification is not shown to the network when a prediction is being made, but can later be used to compare the network efficiency, as we have done in our numerics. The three important components of the algorithmic structure are forward propagation, backward propagation, and gradient update. The forward and backward propagation structures are given in Algorithms 1 and 2. The gradient update is accomplished in the training phase, discussed in subsection 5.1. Lastly, the testing phase of the algorithm is discussed in subsection 5.2.

Algorithm 2 Backward Propagation in Factional-DNN (L 1 -scheme)
Pass ∇ W L, ∇ K L, ∇ b L to gradient based solver with m 2 max iterations to update W, {K j , b j } N −1 j=0 .

10:
CompareĈ train toĈ obs to compute α train 11: end for 5.1. Training Phase. The training phase of Fractional-DNN is shown in Algorithm 3.

5.2.
Testing Phase. The testing phase of Fractional-DNN is shown in Algorithm 4.

Numerical Experiments
In this section, we present several numerical experiments where we use our proposed Fractional-DNN algorithm from section 5 to solve classification problems for two different datasets. We recall that the goal of classification problems, as the name suggests, is to classify

Algorithm 4 Testing Phase of Fractional-DNN
Compare C test to C obs,test to compute α test objects into pre-defined class labels. First we prepare a training dataset and along-with its classification, pass it to the training phase of Fractional-DNN (Algorithm 3). This phase yields the optimal set of parameters learned from the training dataset. They are then used to classify new data points from the testing dataset during the testing phase of Factional-DNN (Algorithm 4). We compare the results of our Fractional-DNN with the classical RNN (9).
The rest of this section is organized as follows: First, we discuss some data preprocessing and implementation details. Then we describe the datasets being used, and finally we present the experimental results.

Implementation Details.
(i) Batch Normalization. During the training phase, we use the batch normalization (BN) technique [29]. At each iteration we randomly select a mini-batch, which comprises of 50% of the training data. We then normalize the mini-batchŶ 0 ⊂ Y 0 , to have a zero mean and a standard deviation of one, i.e.
where µ is the mean and s is the standard deviation of the mini-batch. The normalized mini-batch is then used to train the network in that iteration. At the next iteration, a new mini-batch is randomly selected. This process is repeated m 2 times. Batch normalization prevents gradient blow-up, helps speed up the learning and reduces the variation in parameters being learned.
Since the design variables are learnt on training data processed with BN, we also process the testing data with BN, in which case the mini-batch is the whole testing data.
(ii) Activation Function. For the experiments we have performed, we have used the hyperbolic tangent function as the activation function, for which, σ(x) = tanh(x), and σ ′ (x) = 1 − tanh 2 (x).
(iii) Regularization. In our experiments, we have used the following regularization: where (−∆) h is the discrete Laplacian, and ξ W , ξ K , ξ b are the scalar regularization strengths, and · F is the Frobenius norm.
Notice that with the above regularization, we are enforcing Laplacian-smoothing on K. For a more controlled smoothness, one could also use the fractional Laplacian regularization introduced in [2], see also [6] and [3]. (iv) Order of Fractional Time Derivative. In our computations, we have chosen γ heuristically. We remark that this fractional exponent on time derivative can be learnt in a similar manner as the fractional exponent on Laplacian was learnt in [3]. (v) Optimization Solver and Xavier Initialization. The optimization algorithm we have used is the BFGS method with Armijo line search [31]. The stopping tolerance for the BFGS algorithm is set to 1e−6 or maximum number of optimization iterations m 2 , whichever is achieved first. However, in our experiments, the latter is achieved first in most cases. The design variables are initialized using Xavier initialization [20], according to which, the biases b are initialized as 0, and the entries of W , and K j are drawn from the uniform distribution U[−a, a]. We consider a = training error = 1 − n cor,train n , and α train = n cor,train n × 100.
The same procedure is used to compute C test and α test . (viii) Gradient Test. To verify the gradients in (26), we perform a gradient test by comparing them to a finite difference gradient approximation of (12). In Figure 2 we show that the two conform and we obtain the expected order of convergence for all the design variables.

E(b+hD)-E(b)-h tr(D T gradE(b)) b: O(h 2 ) Gradient Check
Error O(h 2 ) line with slope=2 Figure 2. Comparison between derivative with respect to the design variables and finite difference approximation. The expected rate of convergence is obtained.
(ix) Computational Platform. All the computations have been carried out in MAT-LAB R2015b on a laptop with an Intel Core i7-8550U processor.

Experimental Datasets.
We describe the datasets we have used to validate our proposed Fractional-DNN algorithm below.
• Dataset 1: Coordinate to Level Set (CLS). This data comprises of a set of 2D coordinates, i.e. Y 0 := {(x i , y i ) | i = 1, · · · , n; (x i , y i ) ∈ R 2 ([0, 1])}. Next, we consider the following piecewise function, The coordinates are the features in this case, hence n f = 2. Further, we have n c = 2 classes, which are the two level sets of v(x, y). Thus, for the ith sample Y obs ∈ R nc is a standard basis vector which represents the probability that Y  [17,19]. This dataset comprises of odors of 20 different perfumes measured via a handheld meter (OMX-GR sensor) every second, for 28 seconds. For this data, Y 0 := {(x i , y i ) | i = 1, · · · , n; x i , y i ∈ Z + }, thus n f = 2. The classes, n c = 20, pertain to 20 different perfumes. we construct C obs in the same manner as we did for Dataset 1.

Forward Propagation as a Dynamical
System. In the introduction we mentioned the idea of representing a DNN as an optimization problem constrained by a dynamical system. This has turned out to be a strong tool in studying the underlying mathematics of DNNs. In Figure 3 we numerically demonstrate how this viewpoint enables a more efficient strategy for distinguishing between the classes. First we consider the perfume data, which has two features, namely the (x, y) coordinates, and let it flow, i.e. forward propagate. When this evolved data is presented to the classifier functional (e.g. softmax function in our case), a spatially well-separated data is easier to classify. We plot the input data Y 0 , represented as squares, as well as the evolved data Y N after it has passed through N layers. The 20 different colors correspond to the 20 different classes for the data, which help us visually track the evolution from Y 0 to Y N . The evolution under standard RNN is shown in the left plot, and that of Fractional-DNN is shown in the right plot. The configuration for these plots is the same as discussed in subsection 6.5 below and pertains to the trained models. Notice that at the bottom right corner of the RNN evolution plot, the purple, pink and red data points are overlapping which poses a challenge for the classifier to distinguish between the classes. In contrast, Fractional-DNN has separated out those points quite well. We remark that this separation also gives a hint as to the number of layers needed in a network. We need enough number of layers which would let the data evolve enough to be easily separable. However, the visualization can get restricted to n f ≤ 3, therefore for data with n f > 3, it may be challenging to get a sense of number of layers needed to make the data separable-enough.
6.4. Vanishing Gradient Issue. In earlier sections, we remarked that Fractional-DNN handles the vanishing gradient issue in a better way. The vanishing gradient issue arises  when the gradient of the design variables vanishes across the layers as the network undergoes backpropagation, see [21] and references therein. As a consequence, feature extraction in the initial layers gets severely affected, which in turn affects the learning ability of the whole network. We illustrate this phenomenon for the networks under discussion in Figure 4. In the left plot of Figure 4, we compare the · 2 of the gradient of design variables θ = (K, b) against optimization solver (steepest descent in this case) iterations for standard DNN (which does not have any skip connection) in magenta, classical RNN (9) in black, and Fractional-DNN with L 1 -scheme approximation from Algorithm 3 in red. In the right plot of Figure 4 we have omitted the standard DNN plot to take a closer look at the other two. Observe that as gradient information propagates backward, i.e. from layer N − 1 to 0, its magnitude reduces by one order in the case of standard RNN. This implies that enough information is not being passed to initial layers relative to the last layer. In contrast, the Fractional-DNN is carrying significant information back to the initial layers while maintaining a relative magnitude. This improves the overall health of the network and improves learning. This test has been performed on Perfume Data (Dataset 2) with 70 layers and regularization turned off.
6.5. Experimental Results. We now solve the classification problem (12) for the datasets described in subsection 6.2 via our proposed Fractional-DNN algorithm, presented in section 5.
We then compare it with the standard RNN architecture (9). The details and results of our experiments are given in Table 2. Note that the results obtained via Fractional-DNN are either comparable to (e.g. for CLS data) or significantly better than (e.g. for PD) the standard RNN architecture.
We remark that while CLS data (Dataset 1) is a relatively simpler problem to solve (two features and two classes), the Perfume Data (Dataset 2) is not. In the latter case, each dataset comprises of only two features, and there are 20 different classes. Furthermore, the number of available samples for training is small. In this sense, classification of this dataset is a challenging problem. There have been some results on classification of perfume data using only the training dataset (divided between training and testing) [19], but to the best of our knowledge, classification on the complete dataset using both the training and testing sets [17] is not available.
In our experiments, we have also observed that Fractional-DNN algorithm needs lesser number of Armijo line-search iterations than the standard RNN. This directly reflects an improvement in the learning rate via Fractional-DNN. We remark that in theory, Fractional-DNN should use memory more efficiently than other networks, as it encourages feature reuse in the network.

Discussion
There is a growing body of research which indicates that deep learning algorithms, e.g. a residual neural network, can be cast as optimization problems constrained by ODEs or PDEs. In addition, thinking of continuous optimization problems can make the approaches machine/architecture independent. This opens a plethora of tools from constrained optimization theory which can be used to study, analyze, and enhance the deep learning algorithms. Currently, the mathematical foundations of many machine learning models are largely lacking. Their success is mostly attributed to empirical evidence. Hence, due to the lack of mathematical foundation, it becomes challenging to fix issues, like network instability, vanishing and exploding gradients, long training times, inability to approximate non-smooth functions, etc., when a network breaks down.
In this work we have developed a novel continuous model and stable discretization of deep neural networks that incorporate history. In particular, we have developed a fractional deep neural network (Fractional-DNN) which allows the network to admit memory across all the subsequent layers. We have established this via an optimal control problem formulation of a deep neural network bestowed with a fractional time Caputo derivative. We have then derived the optimality conditions using the Lagrangian formulation. We have also discussed discretization of the fractional time Caputo derivative using L 1 -scheme and presented the algorithmic framework for the discretization.
We expect that by keeping track of history in this manner improves the vanishing gradient problem and can potentially strengthen feature propagation, encourage feature reuse and reduce the number of unknown parameters. We have numerically illustrated the improvement in the vanishing gradient issue via our proposed Fractional-DNN. We have shown that Fractional-DNN is better capable of passing information across the network layers which maintains the relative gradient magnitude across the layers, compared to the standard DNN and standard RNN. This allows for a more meaningful feature extraction to happen at each layer.
We have shown successful application of Fractional-DNN for classification problems using various datasets, namely the Coordinate to Level Set (CLS dataset) and Perfume Data. We have compared the results against the standard-RNN and have shown that the Fractional-DNN algorithm yields improved results.
We emphasis that our proposed Fractional-DNN architecture has a memory effect due to the fact that it allows propagation of features in a cumulative manner, i.e. at each layer all the precedent layers are visible. Reusing the network features in this manner reduces the number of parameters that the network needs to learns in each subsequent layer. Fractional-DNN has a rigorous mathematical foundation and algorithmic framework which establishes a deeper understanding of deep neural networks with memory. This enhances their applicability to scientific and engineering applications.
We remark that code optimization is part of our forthcoming work. This would involve efficient Graphic Processing Unit usage and parallel computing capabilities. We also intend to develop a python version of this code and incorporate it into popular deep learning libraries like TensorFlow, PyTorch etc. We are also interested in expanding the efficiency of this algorithm to large-scale problems suitable for High Performance Computing.