Machine Learning Approaches in Stock Price Prediction: A Systematic Review

Prediction of stock prices is one of the most researched topics and gathers interest from academia and the industry alike. With the emergence of Artificial Intelligence, various algorithms have been employed in order to predict the equity market movement. The combined application of statistics and machine learning algorithms have been designed either for predicting the opening price of the stock the very next day or understanding the long term market in the future. This paper explores the different techniques that are used in the prediction of share prices from traditional machine learning and deep learning methods to neural networks and graph-based approaches. It draws a detailed analysis of the techniques employed in predicting the stock prices as well as explores the challenges entailed along with the future scope of work in the domain.


1.
Introduction The equity capital markets function as a platform for issuing and trading shares of listed companies. Stocks, or shares, address fragmentary possession in an organization, resource, or security, and in this manner, the stock market is a platform for financial backers where they can purchase and sell responsibility for investable resources or offers [1]. The individuals who participate in such an exchange of stocks and assets are termed as market participants. These market participants can be categorized into either Domestic Retail Participant, NRIs and Overseas Citizen of India (OCI)s, Domestic institution, Domestic Asset Management Companies (AMC) or Foreign Investors.
One way for organizations to raise funds for expanding their business or repaying debt is by going public and issuing stocks that are subsequently traded in the secondary markets also known as stock exchanges. The firm offers shares instead of borrowing capital in the form of cash, this lets the firm avoid its chances of incurring losses, debts and paying interest rates. Secondly, for the shareholders, to earn profits and make money. These shareholders, or investors, can make profits [2] by either the companies whose stocks pay regular dividends or by selling their shares in the market when the shares of the company reach a higher rate than the one at which these stocks were purchased. Therefore, predicting the market value of stocks is of great interest to those that engage in the stock market.
The attempt that is made to forecast or predict the upcoming value of the stock, sector of the market or even the entire market is known as Stock Market Prediction. It is an area that has driven the focus of many individuals including not only companies, but also traders, market participants, data analysts, and even computer engineers working in the domain of Machine Learning (ML) and Artificial Intelligence (AI), etc. Investing funds in the market is subjected to various market risks, as the value of the shares of the company is highly dependent upon the profits and performance of the organization in the marketplace and can thus vary due to various factors such as government policies, microeconomic indicators, demand and supply etc. These variations in the market are studied to develop software and programs using various techniques such as ML, Deep Learning, Neural Networks, AI, etc.
Such systems and software can enable the investor to properly anticipate the situation of the company, on the basis of past and present data, the current condition in the market, etc. and give them a direction to make decisions so that they don't lose their valuable money and earn maximum profits. One of the approaches for predicting stock prices is the big data approach that aims to derive insights from a large amount of data that is publicly available and this data is analyzed on platforms such as Hadoop [3]. The base concept of the deep learning approach is to make calculations based on neural networks [4]. Long Short-Term Memory (LSTM) [5] is a special type of Recurrent Neural Network (RNN) that is used to overcome the problem of long-term dependencies. Another way to forecast equity prices is by analyzing the sentiments on social media data [6] or news stories that help in determining the general trend that a particular company's or industries' shares may take based on a collective opinion. The value of a stock is often seen as a time series model and therefore time series analysis [7] is also one popular model for forecasting stock prices.
In this paper: • We present an overview of all the key technical approaches employed for the prediction of stock prices • We explored various challenges in each approach, along with the future scope in each research and drew a comparative study of these approaches. The proposed paper is structured in the following sections: Section I talks about the introduction to the stock market and lists the various techniques employed for stock price predictions followed by Section 2, the Literature Review which gives an overview of the several research works carried out in the field categorized under Traditional ML Techniques, Deep Learning and Neural Networks, Time Series Analysis and Graph-based approaches respectively. Section 3 features the analysis of major contributions, while Section 4 explores the challenges in the existing mechanisms. Lastly, Section V, concludes the paper and outlines the future scope of work.

Traditional Machine Learning Techniques
The authors of [8] studied the behavior of the stock market and determine the best fit model from the several traditional machine learning algorithms which included Random Forest (RF), Support Vector Machine (SVM), Naive Bayes, K-Nearest Neighbor (KNN), and Softmax for stock market prediction. The authors conducted a comparative study of these approaches, several technical indicators were applied to the data that was gathered from different data sources including Yahoo and NSE-India. The accuracy of each model was compared and it was observed that RF gave the most satisfying results for large datasets whereas for small datasets Naive Bayesian revealed the highest accuracy. Another observation made was, as the count of technical indicators was reduced the accuracy of the models decreased.
The paper [9] used various TF-IDF features to forecast the prices of the stocks of the next day based on the data that was gathered from different news channels. The authors computed TF-IDF weights to count the word score. Finally, an HMM model was generated to calculate the probability of a sequence and contained the probabilities of switching values. From this model the authors observed a trend of positive and negative predictions which were partially matching and showed an error of 0.2 to 4%, however increasing the size of the dataset, employing various machine learning algorithms or increasing the number of technical indicators and input features can lead to higher accuracy.
Traditionally, only historical data was applied for forecasting share prices. However, analysts now recognize that relying purely on historical data isn't accurate because a lot of other factors are key to determining the stock price. In the paper [10] the authors study and apply different methods to predict stock prices but a high rate of accuracy is still not achieved even after analyzing major factors affecting the stock price. The authors have reviewed major techniques such as SVM, Regression, Random Forest, etc. and also analyzed hybrid models by combining two or more techniques. According to the authors, some models work better with historical data than with sentiment data. Fusion algorithms yielded results with higher predictions.
The paper [11] by Kunal Pahwa et al uses Linear Regression, the supervised learning approach to predict stock prices. The proposed research work basically outlines the entire process of using a given The paper [12] by Meghna Misra et al concludes that predictions made using the Linear Regression Model have an enhanced accuracy rate after applying the Principal Component Analysis (PCA) on the data for picking out the most relevant components. SVM demonstrates high accuracy on non-linear classification data, Linear regression is preferred for linear data because of its high confidence value, a high accuracy rate was observed on a binary classification model using Random Forest Approach and the Multilayer Perceptron (MLP) yielded the least amount of error while making predictions.
Many of the aforementioned techniques are not just limited to stock price prediction but can also be used broadly in the financial markets as the authors in the paper [13] conclude by studying the application of machine learning models to analyze financial trading and to design optimal strategies for the same. After performing a quantitative analysis of different techniques, the authors recommend delving further into behavioural finance to evaluate market or investor psychology to understand market fluctuations. The authors propose to make use of text mining and machine learning methods to monitor public interaction on digital financial trading platforms.

2.2.
Deep Learning and Neural Networks Yoojeong Song and Jongwoo Lee from Sookmyung Women's University observed that from a large set of Input Features only a few actually affect the stock price, they hence studied these input features and wished to determine the ones which can be employed for the best prediction of stock value. The paper [14] proposes three different Artificial Neural Network models which include the use of multiple-input features, binary features and technical features to find the best approach to achieve the aim. The accuracy of the models was computed and revealed that the model with binary features showed the best accuracy and concluded that binary features are lightweight and are most suitable for stock prediction. However, the study has some limitations in that converting the features to binary eliminates some of the relevant information for prediction.
Delving into specific techniques methods such as the Multi-Layer Perceptron Model (MLP), Sequential Minimal Optimizations and the Partial Least Square Classifier (PLS) have been studied and applied on the Stock Exchange of Thailand Data in the paper 'Stock Closing Price Prediction Using Machine Learning [15] by Pawee Werawithayaset where SET100 stocks were used by using 12 months' worth of data. Although the paper doesn't focus on long term investment decisions, it does present conclusive evidence that the Partial Least Square method yielded minimum error value followed by Sequential Minimal Optimization and the Multilayer Perceptron showed the maximum error value out of the three algorithms chosen for the particular dataset.
[16] focuses on the effect of the indices in the stock price prediction. The model identifies the variables and relationship between the indices and overcomes the limitations of the traditional linear model and uses LSTM to understand the dynamics of the S&P 500 Index. The paper also analyses the sensitivity of internal memory of LSTM modelling. However, the study has some limitations, the difference between the predictive value and actual value becomes large after a certain point and thus cannot be used to develop a system to give a profitable trading strategy.
[17] proposes a system that would recommend stock purchases to the buyers. The approach opted by the authors combines the prediction from historical and real-time data using LSTM for predicting. In the RNN model, latest trading data and technical indicators are given as input in the first layer, followed by the LSTM, a compact layer and finally the output layer gives the predicted value. These predicted values are further integrated with the summarized data which is collected from the news analytics to generate a report showing the percentage in change.

Time Series Analysis
The paper "Share Price Prediction using Machine Learning Technique" [18] represented the stock price in the form of a time series and avoided the complications endured by the model in the training process. The paper used normalised data and a Recurrent Neural Network model for making the predictions that predicted values that were very close to the actual ones and thus, the author's considered machine learning algorithms best for forecasting the stock prices.
The authors of [19] noticed an impact of daily sentiment scores of various companies on the values of their stock prices. As the information or news that gets posted on various social media platforms about/by an organisation can influence the investors to buy/sell the stocks of the company thus affecting its stock value. The authors thus proposed a model for stock market prediction that employed sentimental analysis as one of the indicators.
The algorithm made use of data collected from various online platforms such as Yahoo Finance and positive/negative/neutral tweets as features for the prediction and computed the stock price movement using opening and closing price of stock for the respective company. Another interesting aspect noted by the authors was the effect of holidays, seasonality, trends and non-periodic data and designed a curve time series model which took all these components into account. This culminated in the authors employing the Generalised Additive Model for maximizing prediction quality and to accommodate new components. Finally, Multiple Linear Regression was used to train the model and predict the prices of stocks for the next 10 days.

2.4.
Graph-Based Approach A rather interesting approach has been adopted by Pratik Patil et al in their paper [20] which visualizes the stock market as a graphical network in a rather unique way and the authors have included both correlation and causation using historical price data as well as applying sentiment analysis which is highly useful in taking into account different factors that determine the stock price. The Graph Convolutional Network model proposed in this paper is vulnerable to the detonating inclination issue as nodes with more significant levels will have bigger worth in their convolved feature portrayal, while nodes with a more modest degree will have more modest worth in feature representation. An answer for this issue can diminish the intricacy of the model training. It will likewise be intriguing to check the exhibition of GCN on more conventional time series estimating issues.
Raehyun Kim et al [21] proposed a Hierarchical Attention Network for Stock Prediction (HATS) to forecast share prices and stock index market movement by applying the concept of Graph Theory and Graph Neural Networks. The authors proposed this new method to selectively cluster the available data on the different relations and add that information to the representation. The Hierarchical Attention Network is key to improving the performance and is used to assign different weight values for selection of information based on its importance and relevance.
Another important work in this direction is done by researchers Yang Lieu et al [22] in which they have used information characteristics of tuples in building a knowledge graph which later on is used for feature selection. In the proposed work the authors have used the CNN to extract features and build the semantic information of the news related to the stock. The combination of deep learning and Knowledge graph have proven to be useful for effective feature extraction retaining semantics. However, due to the limited training sets of financial information, knowledge graph extraction seems to be challenging.

Analysis of Major Contributions
The numerous methods applied for achieving share price prediction are broadly divided into four categories: • Traditional Machine Learning Methods -Includes traditional methods such as linear regression analysis and logistic regression analysis. • Deep Learning and Neural Networks -Many of these techniques make use of RNNs and LSTMs which are a special type of RNN. Traditional ML algorithms, particularly SVM, yield relatively higher accuracy as they work well with datasets of high dimensionality.
These algorithms exhibit high sensitivity to outliers.

Deep-Learning
RNNs and LSTMs are the go-to Deep Learning algorithms employed for the task. RNNs offer an advantage as they capture the context of the data while training. LSTMs perform reasonably well as they can correlate the non-linear time series data in the delay state. [26] Requires high training time and large memory requirements. well for linear data and provide reliable stock price forecasts for the short term. [5] term forecasts of stock prices.

Graph-Based
Focuses on forming relationships based on correlation and causation among the nodes which is useful for exploring previously hidden insights and aids informative decision making.
Traditional ML algorithms still outperform graph based approaches in terms of accuracy.

Challenges in Existing Mechanisms
While conducting the study of different approaches used for stock market predictions, some of the limitations in various research observed are listed in this section.
Although the paper [8] considered 12 technical indicators to identify patterns in the stock market. However, the accuracy level lies between 50-70%, thus to increase the level of accuracy, a higher number of technical indicators can be used. In paper [14], the authors made use of binary features, conversion of features to binary values resulted in the loss of some of the relevant data. The dataset considered in [9] was observed to be not large enough and thus requires the addition of data points for better results. The paper [15] doesn't focus on long term investment decisions based on the stock price.
The research in [11] has only been conducted by using a dataset of a single company over 14 years. The stock market includes companies from many different sectors and each sector share may display a slightly distinct trend than the others. And despite the unique approach in [20] of the Graph Theory application for stock predictions, SVMs still result in a higher rate of accuracy. In paper [10] a high rate of accuracy was still not achieved even after analyzing major factors affecting the stock price. The authors have reviewed major techniques such as SVM, Regression, Random Forest, etc. and also analysed hybrid models by combining two or more techniques. From the paper [8] it was observed that with the decrease in the number of technical indicators the accuracy of algorithms also gets reduced. Another conclusion drawn from the analysis was that the RF algorithm delivers the best performance for large datasets and the Naive Bayesian Classifier is the best for small datasets. In paper [14] proposes the use of binary features as ideal for stock price prediction due to its lightweight and implying some kind of event. The paper, however, only made use of ANN to implement the models, whereas other neural network models can also be used to obtain a comparative study of the different models.
Despite the dataset size in [9], the experiment showed satisfying results with the least error of 0.006 % and a maximum of 3.9% in the predictions, however, a larger dataset could be employed for better accuracy. The model proposed in [18] delivered predictions that were very close to that of the actual values. The authors hereby concluded ML algorithms to be the best approach for forecasting stock market prices.
The Partial Least Square method in [15] yielded a minimum error value followed by Sequential Minimal Optimization and the Multilayer Perceptron showed the maximum error value out of the three algorithms chosen. However, other indicators such as the RSI or stochastic oscillator may be used to test the models further. Since this model is more focused on predicting the closing price for the very next day, the project needs further development and modifications to be helpful for making long term investment decisions.
The author of [11] concludes that Machine Learning application to stock price prediction can actually be further enhanced by delving further into deep learning and neural networks. The authors here hints at trying to continuously optimize the algorithm for better results by shifting to Support Vector Machine (SVM) and trying and testing different models as well as new and improved features. The authors of [10] concluded that some models work better with historical data than with sentiment data. Fusion algorithms yielded results with higher predictions. The best performing model out of the ones created in [20] is the GCN based graph, modelled from news co-mentions. A reason for this could be that the graph is causation based instead of being correlation-based The authors of [12] find that when Principal Component Analysis (PCA) is used to choose the most relevant components from the data, predictions generated using the Linear Regression Model have a higher accuracy rate. SVM exhibits great accuracy on non-linear classification data, Linear regression is recommended for linear data as it has a high confidence value, Random Forest Approach shows a high accuracy rate on a binary classification model and the Multilayer Perceptron gives the least error in prediction. Thus, the article offers a clearer understanding of which model to employ based on the sort of data we have. The authors of [13] conclude that a variety of techniques are available for using ML in the financial markets with a vast range of accuracies and efficiencies. Delve further into Behavioral Finance to evaluate market or investor psychology to understand market fluctuations. The authors propose to make use of text mining and machine learning methods to monitor public interaction on digital financial trading platforms.

5.
Conclusion and Future Scope As the stock market is bound tightly to a country's economic growth and brings in huge investments by the investors and issues equities in the public interest, forecasting the movement of the stock prices and the market becomes essential in order to prevent huge losses and make relevant decisions.In this paper, we proposed a comparative study of various algorithms for forecasting the prices of different stocks. The study was extended from the traditional ML algorithms such as RF, KNN, SVM, Naive Bayes, etc. to Deep Learning and Neural Network models such as Convolutional Neural Networks, Artificial Neural Networks, Long Short Term Memory, etc.The study also includes various other approaches such as Sentiment analysis, Time series analysis and Graph-Based algorithms and compares results of these algorithms to predict the stock prices of various companies.
In the future , researchers can focus on combining the sentiment analysis of stocks related information and the numeric value associated with historical value of stocks in predicting stock prices. Further effective stock recommendation systems can also be built by leveraging both information. Deep learning based approaches can be further exploited for better and efficient feature extraction techniques. Graphknowledge approaches are promising solutions which can be used for building stock prediction engines , however research works should address the complexity and gradient of graphs with large number of nodes. Our survey gives insights on future research directions in this area.