Transformers in High-Frequency Trading

Transformer is a deep learning model that, having an innovative performance in many tasks, has uniquely and significantly modified all the cast of mind of the AI scientific community. In this paper, we introduce a Transformer model that is applied to 1-minute timescale in the EURUSD and GBPUSD instruments of forex trading. We use the classical Transformer architecture without the Decoder since the Encoder is enough. Moreover, the Exponential Moving Average (EMA) model is applied to the input, while different values of its smoothing factor α are tested. With cross-entropy training loss less than 0.2, it is exclamatory that Transformers are a promising tool for lucrative strategies in high-frequency trading.


Introduction
Having started with some myths in Ancient Greece, such as the Hephaestus' and Pygmalion's incorporation of the aspects of intelligent automata (Talos) and artificial beings (Galatea and Pandora), it is obvious that humanity has positively anticipated the advent of the Artificial Intelligence (AI) long ago [1].Artificial intelligence is "the ability of a digital computer or computer-controlled robot to perform tasks commonly related to intelligent beings" [2].
The works in the area of AI are coming into being and discussed to a large extent in the scientific community.These research studies presented the theoretical features of the AI technologies and their relevance in many fields of our community [1,2].Machine learning (ML) are a kind of methods usually utilized in AI, which permit for new properties of data relying on known properties produced from the training data.One of the specific fields of ML is deep learning (DL) [1,2].In the recent past, there is an increased attention for doing research in this field.This truth can be shown by a number of articles published in prestigious journals [1,2].
Having been the modus operandi of the artificial neurons, the biological neural networks that constitute animal brains had played a key role to the first stages of this literally admirable technological development.In the Contemporary period the Neural Networks being the core of the Synthetic Intelligence (SI) has highly effectively and drastically influenced all the strands of countless domains, especially the STEM.The inventions of the first Artificial Neural Neuron, the Recurrent Neural Network, the backpropagation, the Long Short-Term Memory network and the Attention Mechanism are some milestones that are worth chronologically mentioning.
More specifically, in 1943, Warren McCulloch and Walter Pitts developed a mathematical model imitating a biological neuron which is considered to be the first Artificial Neural Network.
[3] Three decades later, Shun'ichi Amari introduced the first architecture of the Recurrent Neural Network (RNN) that then was called "Hopfield Network", which consisted of a layer with n fully connected recurrent neurons.[4] In 1986, David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams published an article about the Backpropagation, a fully innovative learning procedure.[5] Hochreiter and Schmidhuber in 1997 suggested the Long Short-Term Memory architecture (LSTM) for defeating some deficiencies of Backpropagation (such as vanishing and exploding gradients and the inability of learning long-term information) [6].
The need for solving Natural Language Processing (NLP) tasks, such as machine translation, dialogue management, and question answering led to state-of-the-art research.The sequence-tosequence (seq2seq) approach, having become the fundamental framework in NLP, is considered to be a turning point: It is composed of an Encoder for analyzing inputs as sequence and a Decoder for predicting the outputs based on the states of both the input sequence and the current output.[7] However the attempt to overcome the seq2seq model's disadvantage of bottle-necking, which requires the capturing of all information of the source sentence, resulted in the invention of the Attention Mechanism [8], which in turn is proven to play an essential and decisive role in Transformers.
On the other hand, almost every field of application of scientific experience involves timedependent phenomena whose values are represented by time series.There are a huge number of techniques and methods for studying time series, from analytical methods [9], simulation and heuristics techniques [10,11,12,13,14] and techniques that exploit the representational capabilities of neural networks.
In this work, we propose a transformer-inspired architecture to produce reliable signals for high-speed trading.Although in its preliminary stage, useful conclusions have been obtained concerning both the need of a smart and inclusive representation scheme for the historical data as well as the architecture of the system.

Attention Mechanisms and Transformers
The Encoder-Decoder architecture for solving NLP problems had been very popular even before the advent of the Attention Mechanism.Especially for neural machine translation the initial models had an Encoder for encoding the source sentence and then a Decoder for translating it to the second language.But it turned out that having a fixed-length vector as input was a bottleneck.In 2015, Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio suggested an extension to the basic architecture by adding a Self-Attention: An automatic (soft-)search for these parts of the initial sentence being more pertinent for translating to a target word.After training and validating some models, the BiLingual Evaluation Understudy (BLEU) score in testing was remarkable.[8] In 2017, in the paper "Attention is All you Need", the proposal of entirely abolishing recurrence and convolution and having a simple model based exclusively on Attention Mechanism is the initial Transformer architecture (figure 1).[15] 2.1.Attention Mechanism Firstly, we will examine the general aspects of the Attention Mechanism, while in the following subsection will focus on the Multi-Head Attention.All the strands of the Attention Mechanism revolve around the following triptych: Queries, Keys and Values.Understanding this triad would probably give a respectful background for the comprehension of Transformers.So what exactly is this essential trio?
From the trained matrix (from the memory) we get all these 196 values: For example, the value of {for, a} is how relative are the words "for" and "a"?Or, the value of {man, victories} represents the association of the words "man" and "victories".
Values: For every one of the 14 words, we have 14 values that describe the relevance this word has with the others.
Simply said, the Attention Mechanism gives high attention to relative words and low attention to the unrelated ones.It helps focus on the important points for the translation.
Mathematically, in the original Transformer model, the Attention Mechanism is called "Scaled Dot-Product Attention" and is given by the following formula: where Q, K and V are the Queries, Keys and Values and d is the dimension of the Queries and Keys.

Components of Transformer
We will analyze and explain all the components of the initial Transformer architecture (figure 1).All this model has to do with the training part: Supervised learning in which the inputs are sequences of an initial language (such as sentences) and the outputs are the respective sequences in the second language in which the translation is done.For example, the WMT includes some datasets that can be used for this training.[16] An analysis of the components is as follows: Inputs: Sequences of the initial language.
Outputs: The respective sequences of the second languages which are the desired translations.
Input Embedding: Due to the fact that the neural networks learn through numbers, the initial sequences of the input have to be converted into vectors.An algorithmic example for doing this is the Word2Vec.
Output Embedding: As in the input, the output sequences have to be converted into vectors.
Positional Encoding: Having no recurrence and no convolution the Transformers do not handle tokens of a sequence one-by-one.Moreover, sequential operations prevent parallel computing, while the self-attention itself ignores the sequence's order.But since the relative position of the words in a sentence is essential for understanding its meaning, it should be included.For this reason, the positional encoding was proposed in the initial architecture: Where pe(i, 2j) and pe(i, 2j+1) are the elements of the i row and 2j column and the i row and 2j+1 column respectively of the positional encoding matrix P that is finally added to the initial matrix containing d embeddings of n tokens of a sequence.
Multi-Head Attention: h times the single Attention Mechanism we described in the previous subsection.This parallelization implies a diversity of the learning part: In most cases of the neural networks, the learning is on stochastic mode, not on deterministic.This means that different trainings will produce different results.Having h times the single Attention Mechanism will provide a more objective and unbiased learning.In the initial Transformer model, they used h = 8 parallel attention layers.Finally, a concatenation is applied combining all these h heads.
Mathematically, the formula for the Multi-Head Attention is as follows: where h are the total number of heads, and W i the linear projections of the queries, keys and values, h times.
Masked Multi-Head Attention: The masking that is in the Multi-Head Attention of the Decoder helps hide the future or the subsequent words: The prediction should only depend on the already generated output, so the auto-regressive property is preserved.From the practical point of view, programmatically, this mask is just a number specifying that any query handles only the positions up to its one.
Add & Norm: "Add" is the residual connection for adding the input to the output.The idea first appeared in the ResNet deep learning model.[17] The term "Norm" refers to the Layer Normalization, a method for expediting and stabilizing the training process.[18] Feed Forward: Classical Feed Forward layers for reforming the attention vectors to the desired forms as inputs to the next layers.
Linear: Another Feed Forward layer for classifying the outputs of the Decoder.Its output is called logits vector.
Softmax: A layer for converting the logits to probability scores between 0 and 1.The desired word of translation is this with the highest probability.
Encoder: This part will be used in our model.In the initial Transformer architecture the encoder consists of 6 identical layers.Each has 2 sublayers, the Multi-Head Attention and the Feed Forward.Finally, the Add & Norm layers improve the performance but do not play a key role.This could be considered as a translator for converting a human language to "engine language".
Decoder: This part will not be used in our model.In the primary model, it also has 6 identical layers, each consisting of 3 sublayers: the Masked Multi-Head Attention, the Multi-Head Attention and the Feed Forward.This could be considered as a translator for converting an "engine language" to human language.

Our Architecture
Instead of translating between two human languages, our model predicts the trends of forex instruments based on historical data.All the idea is based on the concept of focusing on the important points and neglecting the unimportant ones, as it happens with the Attention Mechanism.In this section we will explain and analyze the dataset, the input, the output and finally the structure of our model.

Dataset
The primary 2 datasets are the Tick Data of the instruments EURUSD and GBPUSD from April 2023 to July 2023.But the training is done only for July 2023.For this month, we created a new dataset for every minute that is active in forex.Every minute has its input and its output.

Input
Every minute has as input a vector of 22 elements: The first one derives from the high bid minus its Exponential Moving Average with factor α value 0.7.This is called hb070.The second one derives from the hb070 minus its Exponential Moving Average with factor α value 0.73 and it is called hb073.So finally we have the following set for the high bid: {hb070, hb073, hb076, hb079, hb082, hb085, hb088, hb091, hb094, hb097} which has 10 elements.Then we do exactly the same for the low ask and we have 10 more elements: {la070, la073, la076, la079, la082, la085, la088, la091, la094, la097}.The last 2 elements of the input vector derive from the following formulas: where c is the close price, l the low price and h the high price for both bid (b) and ask (a).This is the "language" we want to make the translation FROM.

Output
Every minute has as output a vector of 10 binary elements.10 positions opening at the end of this minute with the following characteristics: earning 1 pip or losing 3 pips, earning 2 pips or losing 4 pips, earning 3 pips or losing 5 pips, earning 4 pips or losing 6 pips, and earning 5 pips or losing 7 pips, both for buy and sell.The value 0 means losing and the value 1 means earning.This is the "language" we want to make the translation TO.

Multi-Head Attention
Firstly, we define the class of the Multi-Head Attention Mechanism: We initialize the dimensions of the model, the number of heads and all the linear layers for transforming the inputs.Then we define the subclass of the Scaled Dot-Product Attention which calculates the attention scores and probabilities.Also, we define two subclasses for splitting and combining the heads since our attention mechanism is multi-head.Finally, in the classical forward subclass, we calculate the Queries, Keys and Values matrices, then the Scaled Dot-Product Attention and finally we combine the heads, always in accordance with the formulas 4 and 5.It is worth mentioning that our Multi-Head Attention class doesn't have a subclass for masking, as it is not needed for the Encoder.(figure 2)

Feed Forward
We define the Feed Forward class, being the second part of the Encoder and having three layers: Two linear and a ReLU.(figure 3)

Positional Encoding
The definition of the Positional Encoding class.It uses the formulas 2 and 3 and is added to the embedded input.(figure 4)

Encoder
We first initialize the Multi-Head Attention Mechanism and the Feed Forward layers, using the already defined classes.Moreover, we define two layers of Layer Normalization and one of Dropout.Finally, in the classical forward subclass, we apply the sequence of layers as are presented in the initial Transformer model: Firstly, the Multi-Head Attention Mechanism, then the "Add & Norm" using the Layer Normalization, then the Feed Forward and finally again the "Add & Norm".(figure 5)

Our model
Having defined all the necessary classes and subclasses, we are ready to structure our model.First, we define the Encoder Embedding, the Positional Encoding, the Encoder layers, the penultimate Linear layer, the ultimate Sigmoid layer and finally a layer for Dropout.
In the forward subclass, we first apply the Encoder Embedding, the Positional Encoding and the Dropout.Then we apply the Encoder itself and eventually the output of our model is the output of the Encoder after the Linear and then the Sigmoid layers.(figure 6) It is worth mentioning that as expected, in our model we have neither decoder nor a mask generating subclass.

Results
The training happened in the final dataset of 30.000 active minutes of July 2023, and using GPU it needed 3 minutes for the training.
With cross-entropy training loss less than 0.2, it is exclamatory that Transformers are a promising tool for lucrative strategies in high-frequency trading.All experiments were done comparing the results with the naive random investment technique.Already from the choice of profit and loss margins, we have a performance built into the system which although it is in the direction of profit, in most cases leads to a loss due to the existence of spread.The results are summarized in table 1.

Conclusion
Choosing to invest in high speed trading has the challenge of being able to get high returns by taking advantage of fast price changes regardless of whether or not there are trends.On the other hand, however, any such attempt is faced with the existence of high noise that can Using transformers incorporating the technique of the attention mechanism, can detect patterns that can with a high probability predict rapid changes in a stock's price.This can be done by representing the data on different time scales so that support and resistance levels are present in the input data.In the present work this was done by encoding moving averages at different time scales.The results are quite optimistic and give us the impetus for further research both in the part related to the embedding of the historical data and the further improvement of the architecture.

Table 1 .
Results campared with the naive random technique cancel any attempt to gain profit.It is therefore of great importance not only the architecture of any forecasting system but also the appropriate choice of the historical data representation technique.