Identifying key products to trigger new exports: an explainable machine learning approach

Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers.


Introduction
The mechanisms underlying economic development [1] are among the most studied branches of economics since the work of Adam Smith [2].However, the identification of its determinants remains an open problem, despite the flourishing of different models and interpretations [3]; in particular, standard theories, based in aggregated measures of production inputs, have limited capacity to predict growth and to recommend specific industrial policies [4].The line of research based on the works of [5][6][7] moves from the presence of the so called capabilities, the set of endowments countries have and that permit their industrialization and developments.Capabilities are, in practice, hard both to define and measure, since in principle they could span from human capital, to infrastructures, government and so on.The solution proposed in [8] is to infer them from the export baskets, i.e. the diversification structure provided by the set of products exported by the country under investigation.This idea opened up the possibility to apply techniques and methodologies borrowed from physics and network science, which go under the name of economic complexity [9][10][11][12].In particular, the approach discussed by Tacchella et al [11].aims at building a synthetic measure of the Fitness of a country, which is able to forecast the GDP growth with a precision higher than the state of the art methodologies [13].However, this approach provides a global picture of the country, while a more detailed analysis is often needed in order to provide specific industrial recommendations [14,15].In this perspective, a number of papers built networks whose nodes are products and links are given by their similarity, proxied by their co-occurrences in the export baskets of countries [10,16,17].In such a way, two products can be defined as close in the sense that they share many of the capabilities needed in order to export them in a competitive way.Co-occurrences based approaches have however a low predictive performance, and this fact favors machine learning approaches as better tools to measure relatedness both at country [18][19][20] and firm level [21,22].In [16,23,24], the authors proposed approaches to explicitly model the relationship among products, capabilities, and development.These frameworks naturally lead to the concepts of product progression [16,19,25] and arrow of development [26]: the relationship between products is often not undirected, or symmetric, as in the product space [10], but directed: countries starts their development from simple products and gradually enter in more sophisticated markets, following well defined paths of development [16].Obviously, the identification of the specific products enabling countries to competitively export a given target product is a key element to design industrial policies and strategic patterns of development.Despite the importance of this investigation, a specific analysis was missing because of the lack of suitable tools and algorithms able to successfully forecast the export of countries.However, thanks to the introduction of machine learning in the economic complexity analysis [18], the tools at disposal reached a maturity such that this investigation can start providing concrete and scientifically validated results.This is the aim of the present paper: to provide an algorithmic approach based on a highly predictive machine learning method to measure the importance of single products and sectors for a country to export a specific target product in a given amount of years.The link between starting and target products will be quantified by using the feature importance, a key tool of supervised machine learning algorithms that allows a clear interpretation of the outputs.Recently, the computer science community felt the necessity to provide tools (such as Shapley values) to increase the interpretability of machine learning models; this led to a number of theoretical results and practical investigations [27][28][29].For an application of Shapley values to economic complexity-related issues, see [30,31].

The predictive framework
Our aim is to understand which products enable a country to export a given target product.To do so, we investigate the mechanisms underlying a machine learning based prediction approach [18].Such approach considers the competitiveness level of each country's export on each product as features [19]: obviously, some products will be dominant in the forecast exercise, while others will be practically irrelevant.The feature importance [19,32,33] will be our statistically validated measure of the ability of a product to activate another product.It is obviously of key importance to adopt a framework which has an excellent forecasting power.The approach discussed here, based on the Random Forest (RF) algorithm [34], outperforms the networks of co-occurrences [18] as well as other supervised machine learning algorithms [19], also when other data typologies are considered [18,21,22].Here we briefly summarize the predictive framework.Full details are provided in the methods section.
The predictive task is represented by the out-of-sample forecast of the appearance of new links in the country-product temporal network [19].At a given year y, the network represents whether country c exports product p in a competitive way or not.Mathematically, it is identified by the adjacency matrix M whose elements are where c is a country, p is a product, and y is the selected year.RCA cp (y) is the Revealed Comparative Advantage [35], and it quantifies the relative advantage in year y of country c in exporting product p (more details are provided in the section 5).Given the knowledge of the network in a certain time interval, the RF algorithm can be trained in an appropriate cross-validated framework to make out-of-sample predictions on the matrix M cp (y + δ), starting from the knowledge of M cp (y), that is, which country exports which product.
In particular, a different RF model is trained for each product, and the other products are used as inputs, or features; in such a way, the RF learns from the past which products are usually associated with the target product.
In the present study we cover the time span 1996-2018, and choose a time interval δ = 5 years: the algorithm is trained on years 1996-2013 to make predictions on 2018.The number of countries is 169, while products are classified according to the Harmonized System (HS) 1992, which has a hierarchical structure: products can be aggregated in 97 different sectors (2-digit code level), or split into 5040 detailed products (6-digit code level).The size of the matrix M will change accordingly.In the section 5 we provide more details about the data and the construction of the predictive model.

Feature importances
The interpretation of the predictions provided by the RF starts with the quantification of the importance the algorithm assigns to each feature (i.e.product) during the training procedure.In our setting, the goal is to forecast whether a product p at the 6-digit level will be exported by a country in year y + δ, knowing in which of the 97 2-digit sectors the country is active in year y (always in the RCA sense).The 2-digit sectors are hence used as binary features, whose value is 1 if the RCA of the country on the 2-digit product is greater than 1, and the RCA is computed using the sum of the export volumes of the 6-digit products that belongs to the sector.The importance of a feature is a measure of how much the activity of a country in a sector (i.e.export or non-export) is informative in order to determine if it will export the 6-digit product p after δ years.The decision of using 2-digit sectors as features is due to the computational time needed for the construction of the model, which would have been prohibitive if we had used all 5040 products.
The quantification of feature importances is obtained using the Gini importance [36] (or mean impurity decrease), a Random Forest-specific measure assigning to the features importance values summing up to 1.The mathematical definition of Gini impurity is provided in the section 5. Starting from the raw values, we performed a suitable statistical validation procedure, computing the corresponding p-values and imposing a validation threshold of 95%: such validation is based on the computation of the null importances, i.e. the importance values the algorithm assigns to each variable after its association with the target vector is broken (see materials and methods for a detailed description of the procedure) [37].Only the statistically validated importances are kept, while the others are put equal to zero.Hence we obtain, for each of the 5040 predicted products, a vector containing the validated importance measures for the 97 aggregate productive sectors.We call the products retaining a significant importance value explainers: these products enhance the probability of a country to competitively export the target product as they signal the presence of the capabilities needed for it.In figure 1 we report the barplots of the feature importances for the products 'Tobacco (not stemmed or stripped)' (code 240110), 'Sports footwear' (code 640411) and 'Vacuum cleaners' (code 850910), showing the 10 most important and the 5 least important sectors.The colors represents whether the feature importance has been statistically validated (blue), or not (red).In all three cases we can notice how the explainers can be intuitively related to the products: e.g. the 2-digit sectors to which the 6-digit products belong are correctly recovered among the explainers (respectively, 'Tobacco and tobacco substitutes' , code 24, 'Footwear; gaiters and the like' , code 64, and 'Electrical machinery and equipment' , code 85).This represent a first qualitative test of the ability of the implemented methodology to recover significant correlations between productive sectors and products, as learned by Random Forest in its training procedure.

Feature importance product space
The Gini importance vectors can be interpreted as high-dimensional representations for the products, like word embeddings [38] in natural language processing [39].Indeed, they contain information about the productive background that the Random Forest algorithm recognizes as necessary or highly predictive for their future export.Hence, the distance between such vectors can be used as a proxy for products' similarity: two products whose Gini importance vectors are close need a similar presence/absence pattern of capabilities in order to be competitively exported.
To test this hypothesis we projected the 97-dimensional vectors on a 2-dimensional continuous space, using t-SNE [40]: a popular dimensionality reduction technique used for visualizing high-dimensional data in a lower-dimensional space.In short, it models the pairwise similarities between high-dimensional data points and maps them to a lower-dimensional space, preserving local structure and revealing meaningful patterns or clusters.The result, which we call Feature Importance Product Space (FIPS), is reported in figure 2. Here, each dot represents a 6-digit product, and the colors correspond to ten aggregate macro-categories (see supplementary information section S3).The structure of the FIPS is heterogeneous, with clusters of products belonging to single categories, as for Agrifood and Textiles (left side of the plot) and regions with the superposition of different product categories, as in the right side of the plot, where there is a mixing of Machinery, Vehicles, Chemicals and Instruments.This differentiation can be traced back to the complexity of products making up different sectors: less sophisticated sectors tend to be more distinguishable, as they need few capabilities, and therefore share similarity patterns with a smaller set of other products (see Materials and methods and supplementary information section S1 for further analyses).On the contrary, high-complexity products share large portions of the respective production lines and supply chains [41].To highlight the ability of the space to identify the similarity of products even if they originally belong to different productive categories, we pinpointed two small clusters: the first (upper-left side of the figure) groups products related to the fur manufacture; the second (lower-right side of the figure) puts together different typologies of products, all related to the spacecraft industry.

Predicting products' appearances with FIPS
The ability to make out-of-sample forecasting on the country-product network represents the natural field to test and compare the validity of relatedness measures [19].Therefore, in order to quantify the goodness of the FIPS reconstruction and the amount of information it brings, we use it to predict the appearance of new products in M cp (2018), employing a density-based approach [10].In other words we predict that countries will become competitive in new products which are close in the FIPS space to other products in which the country is already competitive.Practically predictions on a single product, for every country, are based on the amount of already exported products, each weighted by its link with the target product.In table 1 we report the predictive performance of the FIPS, together with the performance of the Random Forest from which the FIPS was built, and with the temporal auto-correlation baseline represented by RCA cp (2013), both at the 6-digit and at the 2-digit aggregation level 5 .The RCA baseline involves utilizing the country's revealed comparative advantage value for the product in 2013 as an estimate of the probability that the country will export the product in 2018: for the 2-digit, the country's comparative advantage in sector s is attributed to all the 6-digit products p belonging to s.The rationale behind this approach is that it is more likely for a country The FIPS is a low-dimensional representation of the feature importance vectors: each dot is a product, identified by its explainers.The colors correspond to ten productive macro-categories.The structure of the space is heterogeneous, with some relatively isolated categories of products (e.g.Textiles and Agrifood), and areas occupied by a mixing of different kinds of products (e.g. the right side of the figure, where there is a mixing of Machinery, Vehicles, Chemicals and Instruments).The two insets are zooms which testify the ability of the FIPS to group together similar products: the first (upper-left side) is composed of four different kind of products, all related to furskins; the second (lower-right side) groups together products belonging to the spacecraft industry.
Table 1.Comparison of prediction performances of FIPS, Random Forest and RCA baseline.The values of the performance metrics show that the FIPS performance is overall comparable with the Random Forest, showing higher values of Best F1 and mP@10, but lower values of AUC-ROC and AUC-PR: this result is very important, as it guarantees that the FIPS not only provides a fully interpretable predictive model in terms of products' similarity relationships, but retains the predictive power of the Random Forest it was built from.The RCA baseline has the highest Best F1 Score when built using the full 6-digit level data, while it provides the worst overall performance at the 2-digit level.to export a product in the future if it already has a positive RCA for that product.The adopted performance metrics are (see Materials and methods for a detailed discussion): • Best F1 Score: the F1-score [42], i.e. the harmonic mean of Precision and Recall [43], computed for the decision threshold that maximizes its value; • AUC-ROC [44]: the area under the Receiving Operator Characteristic curve; • mP@10: the average, over the countries, of the Precision score on the top 10 predicted products.
The scores show that the FIPS performs better than the original RF for both Best F1 Score and mP@10, while achieving a lower value of AUC-ROC and AUC-PR: this result is extremely relevant, as it implies that the FIPS has a forecasting power comparable to the Random Forest it was built from, while providing a clear interpretability of its predictions in terms of similarity relationships between products.Moreover, we stress that the AUC-ROC metrics is the least reliable, due to the strong class imbalance of the dataset (see [19] and  Mcp(2018).The values of Pseudo R 2 show that the information carried by the FIPS and the RCA baseline are complementary, as the logistic regression trained on both models shows the highest value.This is confirmed by the performance metrics, as the latter shows a performance higher than both FIPS and RCA, when used individually (both directly and in a logistic regression setting) to make predictions; Random Forest anyway retains the highest AUC-ROC value.All performances are computed on the new products activations defined by RCAcp(y) < 0. 25  Materials and methods).The RCA baseline at the 6-digit level, while trailing both FIPS and Random Forest in AUC-ROC, mP@10 and AC-PR, outperforms both models in Best F1 Score.This is due to the different granularity of the inputs, i.e. the 6-digit products for the former, and the 2-digit aggregated sectors for FIPS and RF, as confirmed by the superior performance of the latter models with respect to the 2-digit RCA baseline, which provides the worst overall performance (note that when the RF is trained at 6-digit it easily overcomes the 6-digit RCA baseline [18,19]).However, the 6-digit RCA represents an important benchmark as it has been shown to perform substantially better than co-occurrence based approaches [18].We expect, however, that the FIPS is uncovering fundamental capability-based explanations, that are sensibly different from the autocorrelation signal expressed by the RCA, and this cannot immediately be seen from the forecasting performance scores.In order to assess the additional information carried by the FIPS with respect to the temporal auto-correlation of the network, we decided to plug both the prediction score on M cp (2018) by FIPS and the 6-digit RCA cp (2013) as variables into a logistic regression whose dependent variable is the possible activation of a product.The logit model is trained on the activations (RCA cp (y) < 0.25 for y ∈ [1996][1997][1998][1999][2000][2001][2002][2003][2004][2005][2006][2007][2008][2009][2010][2011][2012][2013]) in an appropriate cross-validated setting, to make out-of-sample predictions on M cp (2018) (see Materials and methods).The results, reported in table 2, confirm the validity of the information carried by the FIPS as complementary with respect to the network auto-correlation in two ways.First of all, the logit model trained on both FIPS and RCA has the highest value of Pseudo R 2 ; secondly, this model displays a better predictive performance with respect to both logit models trained on RCA cp (2013) and FIPS alone.We further compare it with the prediction accuracy provided by RCA cp (2013), RF, and FIPS alone (i.e.without being used as variables in a logistic regression), showing that the FIPS + RCA logit model has the highest Best F1 Score, and it trails only Random Forest for the AUC-ROC score.

Feature importance and products' complexity
Another key assessment of this study is the unveiling of a connection between the feature importance vector of a product and its complexity.The complexity of a product, defined applying the Economic Fitness and Complexity algorithm [11,46] to the bipartite network country-product, is a non-monetary indicator related to the level of industrial sophistication needed to competitively export it on the global market.As such, we expect it to be connected to the nature of the explainers obtained for a product, as they represent the productive sectors recognized by our model as necessary for the future export of the product: the more complex a product, the more complex we expect the corresponding explainers to be.To measure the complexity of the features we applied the fitness and complexity algorithm to the bipartite network that connects countries with the 97 2-digit sectors.Since we train our Random Forest models using data in the time span 1996-2013, the complexities of products were computed as the average of the annual (log-) complexities in the same interval.The visualization of the average complexity of the validated features versus the complexity of the corresponding target products (figure 3) confirms this idea: more complex products need, on average, more complex features in order to be competitively exported.This finding confirms that the production lines of highly sophisticated products are deeply entangled among themselves [41,47].

Discussion
Relatedness [48] is a central topic of the economic complexity approach and a key element for investment decisions and policy makers [14,15].The idea is to empirically measure how close a country is to exporting a new product, that is to assess the feasibility of such a strategy.By comparing the predicting performances of different methodologies, recent studies [18,19] showed that machine learning algorithms such as RF provide the state-of-the-art assessment of relatedness; here the features of this supervised machine learning approach are the products which are present or absent in the export basket of countries.The cost of a better prediction and relatedness assessment is, however, a reduced interpretability of the results, at least with respect to the traditional, network-based approaches [10,16].Nevertheless, having a visual representation of the diversification dynamics of countries, as well as knowing which products are the most relevant to activate (or to explain) the export of a new product is essential in order to inform industrial policies and to understand the different patterns of economic development.In this study, we address the problem of the black box nature of the RF algorithm by proposing a methodology to extract information on the relevance of each input feature (a 2-digit sector) as a predictor of the future export of each of the 5000 possible target products at 6 digits.The starting point is the construction of a predictive model for the possible future export of each target product, based on the training of a RF algorithm.We then apply a procedure to statistically validate the Gini Importance of the single input features; in this way we are able to identify the explainers, the key products needed by a country to competitively export a target product in the near future.The importance the algorithm assigns to each input feature for every target product can be arranged in a 97-dimensional feature importance vector, which represents a highly dimensional embedding of the about 5000 target products.By means of the t-SNE algorithm [49], we project such vectors on a 2-dimensional continuous space we call Feature Importance Product Space (FIPS).Here each point represents a product, and the closeness between points indicates that the corresponding products are similar in the sense that they share most of the explainers needed for their export.As such, this approach is closer to the theoretical approach discussed in the seminal papers by Teece et al [6,50], in which the capability overlap between products is detected a posteriori by counting their co-occurrences, an approach known in the complexity field as the Product Space [10].Here, instead, the proximity is assessed by comparing which input sectors are needed for the target products; similar explainers clearly imply similar capabilities.

Conclusions
The density-based approach employed in the Feature Importance Product Space (FIPS) has demonstrated its capability to forecast future exports, revealing that it provides better predictions than the Random Forest algorithm from which it is derived.Additionally, the integration of FIPS into a logistic regression model, alongside the Revealed Comparative Advantage (RCA) of countries on products, has not only affirmed the significance of FIPS as a predictor of future exports but also displayed superior performance over the strong benchmark model given by RCA itself.These results confirm the validity of our approach, highlighting that the FIPS not only retains the predictive power of the black-box algorithm it is based on but also enhances interpretability, a notable advantage over low performing network-based approaches.Importantly, the FIPS adeptly captures information about the complexity of products, successfully identifying the most sophisticated sectors within dense clusters and isolating less complex products.This understanding is crucial for characterizing the capabilities needed to be competitive in the export of complex products.In conclusion, our study acknowledges the limitation of using 97 2-digit aggregated sectors as features, which slightly reduces the predictive power of the RF compared to using the more granular 6-digit 5040 products.This limitation was a pragmatic choice due to computational constraints.However, future works will aim to optimize the model for more detailed feature analysis at the 4-and 6-digit levels, further enhancing the model's accuracy and applicability in the field of economic complexity and trade prediction.

Data
The starting data used in this study is gathered by UN-COMTRADE and available upon subscription on the website https://comtrade.un.org.UN-COMTRADE provides the annual bilateral export flows between countries at the 6-digit product level.Products are classified according to the Harmonized Commodity Description and Coding System, in its 1992 version (HS-1992): each product is identified by a 6 digits code, where each couple of digits refers to a different aggregation level.The total number of products ranges from 97 at the 2 digit level (aggregated sectors), to 5040 at the 6 digit level (detailed products).
Since importers' and exporters' declarations not always coincide, a Bayesian reconciliation procedure [13] is performed on data, leading to the definition of the annual export matrices E cp (y).Each element corresponds to the export volume realized by country c, for product p, in year y.The total number of countries is 169, and the covered time span is 1996-2018.
Following the standard procedure in the economic complexity literature [10,11], we compute the Revealed Comparative Advantage [35]: This economic indicator measures the ratio between the weight that the export of a product p has for country c and the weight it has on the global market.In this way, we can filter out the size effects of both countries and industrial sectors.Finally, imposing a threshold equal to RCA cp = 1, distinguishing whether country c is a competitive exporter of product p in year y, we obtain the binary adjacency matrices M cp (y), as described in equation ( 1).The dimension of both the RCA and the M matrix is the number of countries on the rows (169) and the number of products on the columns, (5040 at the 6-digit level and 97 at the 2-digit level).

Random Forest
In order to forecast the export of countries, we train a supervised machine learning algorithm.In particular, we train one model for each target product; being the answer binary, we adopt a classification algorithm.RF [34] is an ensemble method based on the aggregation of several decision trees [51]: the final prediction of the algorithm is given by the average of the predictions made by the single trees.
The Random Forest has been shown [19] to be the top performing algorithm, together with XGBoost [52], for our predictive task, which is discussed in detail in the next section.We point out that XGBoost is practically unfeasible for the specific investigation discussed here because of the needed computational effort.Moreover, the extraction of the feature importances is much more direct in the case of Random Forest.
In this study we made use of the Python implementation provided by the library scikit-learn 6 , which makes use of the CART version of the algorithm [36].The hyperparameters [32] were set to their default values, a usual choice given the relative stability of the predictive performance [53][54][55].
For a more detailed description of the use of the Random Forest for predicting countries' exports (training, overfitting, prediction power), we refer to [19,21], in particular in the supplementary information of [19] it is shown that using the default values for the hyperparameters does not involve a significant worsening of the RF's performance.

Predictive model
The aim of the application of the Random Forest algorithm [34] to the country-product network is to build a predictive model able to forecast the export baskets of countries after δ years, given the knowledge of their present export baskets.This means predicting the structure of the network M cp (y + δ) starting from M cp (y).This is realized through the construction of a single model for each target product p ′ , performing a binary classification task.Given the knowledge of the network in the time span [y 0 , y f ], such model is trained on the set: and, in this process, learns which export baskets in X train are associated to the countries exporting or not exporting p ′ (y train ).The test set is defined in a similar way: In this way we make sure the test is performed on completely unforeseen data, and prevent the algorithm from having any information about the structure of the network in years y > y f − δ during the learning phase.The data relative to different years is stacked together vertically: in this perspective each country in each year represents an observation, the export baskets for all products its features, and the possible export of p ′ , δ years later, the corresponding class.Putting together the predictions provided for all products, we recover the full matrix of predictions whose elements S cp (y f ), can be tested against M cp (y f ).It is to be noted that the prediction on a single element S cp (y f ) is a probability value between 0 and 1, to be binarized with the choice of a threshold in order to be compared to the empirical element S cp (y f ).So, for each product p ′ , the model is trained to associate its possible future export from every country in year y + δ, to the information about the respective export baskets of all products in year y.The rationale is that the algorithm will base its predictions upon learning the similarity patterns between different products, using different countries as different observations.To further explore the functioning of our predictive model based on Random Forest, we refer to [19] where the only difference is that, in the present study, we set the input data X at the 2 digit aggregation level: hence, for each of the 5040 6 digit products y, the input is represented by the export data about the 97 2 digit aggregated productive sectors.

Cross-validation
Given the strong temporal auto-correlation of the network [19], the knowledge of the present export basket of a country is very informative on its future export basket.So, in order to make sure that the predictions provided by the model are based solely on its learning of the correlations between products, rather than on its ability to recognize the country, we perform a 13-fold cross-validation procedure.The 169 countries are divided into 13 groups { C k } 13 k=1 of 13 countries each.For each product, we then build 13 different models, where each one is trained on data about the 156 countries c / ∈ C k and is then used to make predictions for countries c ∈ C k .In this way the predictions for every country are provided by a model that did not receive any information about the country itself.The supplementary information provides a schematic illustration to aid the reader in visualizing the delineation of the training and testing datasets, as well as the cross-validation process.

Feature importance
The directed link from a product p whose presence (or absence) enhances the likelihood that a general country exports also the target product p ′ is given by the feature importance, i.e. the relevance the RF algorithm attributes to each feature p in its predictive task.The construction of each decision tree in the forest is based on the recursive split of the observations that compose the training set, in terms of the corresponding values of the features [34]: starting from the root node (containing all the observations), each node considers a feature, and depending on the binary value of this feature, the observations are divided into two child nodes.The choice of the feature for each node is meant to maximize the decrease in Gini impurity, a metrics measuring the impurity of a node as the compresence of observations belonging to both classes (i.e. 1 and 0), given by [51]: where m is the node, j the corresponding feature and pm,i is the empirical frequency of observations in the node belonging to class i.The decrease in impurity realized by feature j on node m is then: where 1 and 2 indicate the two child nodes built in the split, and f the corresponding fractions of observations they receive.On a single tree t, being N(j) the number of nodes to which feature j is attached, and V the total number of features, the decrease in impurity realized by feature j, i.e. its Gini importance, is equal to: .
The Gini importance of a feature is then given by the average decrease in Gini impurity the feature realizes over the whole forest [36]: where T is the total number of trees.

Statistical validation procedure
Given the feature importance values, it is important to distinguish which features are actually informative for the algorithm, and which got a non-zero value because of spurious correlations in the dataset.We then implemented a statistical validation procedure similar to the one described in [37], in order to compute, for each feature importance, its corresponding p-value.The method is based on the reconstruction, for each feature importance, of the corresponding null distribution, i.e. a distribution of the importance values a feature is given by the algorithm under the hypothesis of independence between the feature itself and the response vector y train .The procedure works as follows: 1.For every product p ′ , we train the Random Forest 50 times and compute the Gini importance, obtaining 50 vectors of feature importance gi n (p ′ ), n = 1, . . ., 50. 2. We then permute the response vector y train 500 times, breaking its association with the feature, and recompute the Gini importance after every permutation.In this way we obtain 500 vectors of null importance ni m (p ′ ), m = 1, . . ., 500.3.For each feature, we compare each of the 50 values of Gini importance with the 500 values of null importance: the corresponding p-value is computed as the fraction of 500 null importance values bigger than the Gini importance value.We then obtain, for each product, 50 vectors of p-values pv n (p ′ ), n = 1, . . ., 50. 4. We take the average vectors of Gini importance: and we keep only the importance values of the features for which more than 95% of the p-values (i.e. at least 48 out of 50) are within the 95% significance threshold (i.e.p < 0.05), putting the others to 0.
In this way we obtain, for each product, a vector containing the 97 values of statistically validated feature importance, for the 97 features.The choice of the number of repetitions and permutations is consistent with the heavy computational cost involved.It has to be noted that point 1 and point 2 are carried out separately on each of the 13 folds of the cross-validation setting, and the corresponding values are averaged out.This computation required approximately 180 hours on a server with 20 cores The method has been extensively tested on both low-dimensional [56] and high-dimensional [57] datasets, showing a great ability to filter out the non-informative features.

Feature importance product space
The feature importance vectors contain information about the productive sectors recognized by the Random Forest as important in order to competitively export the corresponding product 5 years later.Therefore, the distance between the vectors relative to two products can be seen as a natural proxy for their similarity, i.e. of the overlap of capabilities needed for their export.Then, in line with the Continuous Projection Space proposed in [18], we project these vectors on a 2-dimensional space, via the dimensionality reduction algorithm t-SNE [40]: on such space, which we call Feature Importance Product Space (FIPS), the distance between products is related to the distance between their original 97-dimensional vectors.
At this point, we can use the space to make out-of-sample forecasts on the activation of new exports after 5 years, by adopting the density-based approach explained in the following.This is a natural way to validate the FIPS idea and building procedure.
We first compute the matrix D of euclidean distances between products in the FIPS.Then we transform these distances into a similarity matrix B, where the similarity of two products p and p ′ is computed as: where σ a free parameter.Given this similarity matrix, following the economic complexity literature [10], we perform the prediction on M cp (y + δ) by relating the likelihood of an activation to the scores defined by: i.e. for each country c the prediction on its future export of a product p is given by the sum on the the products it already exports, weighted by their similarity with p.
We build the FIPS on the Random Forest trained on data in years 1996-2013, and then used it to make out-of-sample forecasting on M cp (2018).We use of the Python implementation of t-SNE algorithm provided by the library scikit-learn7 .

Optimization of the parameters
The predictions provided by the FIPS (equation ( 4)) depend on two parameters: the perplexity value set for t-SNE and the standard deviation σ chosen for the gaussian weights (equation ( 3)).The former is a hyperparameter of the t-SNE algorithm, fixing the expected number of elements that will be grouped into each cluster [40].The latter fixes the width of the gaussian distribution centered on each product to attribute the similarity weights to all the other products.The two parameters are then connected, as increasing the perplexity value will result in a denser FIPS, and hence even for small values of σ many products will get a high similarity score.
We therefore opt to combine them into a single parameter, which we call average nearest neighbors, computed empirically as the (average) number of neighbouring products contained within a circle of radius 3σ, centered on each product in the space.In practice, given a fixed value of perplexity, we look for the value of σ corresponding to integer values of average nearest neighbors, and then evaluate the performance of the FIPS, measured by Best F1 Score and Mean Precision at 10, as a function of this number.In figures 4 and 5 we show the trends of the two metrics for seven different values of perplexity (P = 5, 10, 15, 25, 35, 50, 100): in both cases curves corresponding to different perplexity values are quite close, with a peak for a value of average nearest neighbors around 70.We chose to set perplexity = 10, and the corresponding value σ = 4.58.The performance values reported in the section 2 are computed for these values of the parameters.

Logit model
To assess the additional information carried by the FIPS with respect to the temporal auto-correlation of the M matrices, we use the predictions provided by FIPS and RCA(2013) as independent variables in a logistic regression for the probability of products' appearances in 2018, given by the equation:  The model is trained only on the activations (defined by RCA cp (y) < 0.25 for y ∈ [1996 − 2013], see [18]).In order to test the out-of-sample performance, we divide the training set into 13 subsets, following a cross-validation procedure: the predictions ⃗ S k (2018) for each group k (k = 1, . . ., 13) are provided by a model trained on the remaining 12, and so can be tested against the corresponding elements ⃗ M k (2018).
The logistic regression was carried out using the Logit algorithm provided by the Python library statsmodels8 .

Performance metrics
To evaluate the predictive performances of the models, we made use of a series of evaluation metrics commonly used in Machine Learning.As already mentioned, the predictions S cp (2018) are probability values, to be binarized in order to compare them with the answers given by the matrix elements M cp (2018).In order to avoid the introduction of an arbitrary binarization threshold t, we opted for the use of 'threshold-free' metrics, assessing the overall predictive performance of the models.Moreover, given the strong class imbalance of the dataset (the fraction of positive elements in the M cp matrices oscillates around the 10% of the total elements in the covered time span, see [19]), we avoided metrics such as accuracy, awarding the correct individuation of true negatives (i.e.correct classification of elements M cp (2018) = 0, which are often trivial).The chosen metrics are: • AUC-ROC.The AUC − ROC, as suggested by the name (Area Under the Curve of the Receiving Operator Characteristic) [44,58] measures the area under the Receiving Operator curve, i.e. the curve in the TPR(t) vs FPR(t) plane (respectively True Positive Rate and False Positive Rate, see [43]) obtained by varying the value of the binarization threshold t.Its value, ranging from 0 to 1, represents the probability that the classifier attributes an higher score to a positive element rather than to a negative one: AUC − ROC = 1 represents a perfect classifier, while AUC − ROC = 0.5 corresponds to a totally random classifier.It has been shown [45] that the AUC − ROC is not fully reliable when the classifier is applied to an imbalanced dataset, which is our case (see [19]), as it tends to overestimate the actual accuracy of the predictions.• Mean Precision at k.The precision is defined as the ratio between the true positives (i.e. the positively classified elements that are actually positive) and all the positively classified elements [43].We can define the Precision at k as the precision of the classifier on the k top-ranked elements, i.e. we classify the k elements with higher prediction scores as positives and then compute the corresponding precision.The mean Precision at k is obtained by computing the Precision at k for every country individually, and then taking the average over all countries.Since the most diversified countries tend to activate more products than the low and medium income ones, the averaging procedure allows to filter-out this effect, retaining an overall estimate of the classifier's performance.The value of k was set to 10. • Best F1 Score.The F1 Score is defined as the harmonic mean of precision and recall [43].Therefore it provides an estimate of the overall quality of the classifier, as it assumes an high value only if both precision and recall are high.Since these two quantities rely on the choice of a binarization threshold t, we adopted the Best F1 Score, i.e. the F1 Score computed for the value of t that maximizes it.• AUC-PR.The AUC-PR measures the area under the curve drawn in the precision(t) vs recall(t) (see [43] for details) plane by varying the binarization threshold t.As such, it assesses the overall ability of the model to correctly classify positive elements.Differently from the AUC-ROC, the AUC-PR has been shown not to be affected by class imbalance [45].

Figure 1 .
Figure 1.Feature importances barplots.Three examples of feature importances barplots; from top to bottom, 'Tobacco' , 'Sports footwear' , and 'Vacuum cleaners' .We report the top 10 and bottom 5 features: for all three products, the method validates a set of very reasonable features.Noticeably, in all three cases, the aggregated sectors to which the products belong are present among the explainers: 'Tobacco and tobacco substitutes' (code 24) for 'Tobacco (not stemmed or stripped)' (code 240110), 'Footwear; gaiters and the like' (code 64) for 'Sports footwear' (code 640411) and 'Electrical machinery and equipment (code 85) for 'Vacuum cleaners' (code 850910).

Figure 2 .
Figure2.Feature Importance Product Space (FIPS).The FIPS is a low-dimensional representation of the feature importance vectors: each dot is a product, identified by its explainers.The colors correspond to ten productive macro-categories.The structure of the space is heterogeneous, with some relatively isolated categories of products (e.g.Textiles and Agrifood), and areas occupied by a mixing of different kinds of products (e.g. the right side of the figure, where there is a mixing of Machinery, Vehicles, Chemicals and Instruments).The two insets are zooms which testify the ability of the FIPS to group together similar products: the first (upper-left side) is composed of four different kind of products, all related to furskins; the second (lower-right side) groups together products belonging to the spacecraft industry.

Figure 3 .
Figure 3. Mean complexity of explainers versus the mean complexity of target products.Non parametric regression of the mean complexity of features with respect to the complexity of the target products.The values of complexity are computed as averages of the corresponding logarithms in the time span 1996-2013.Target products are sorted by increasing complexity and then grouped into 20 bins of 252 products each, for which we compute average complexity and standard deviation.

Figure 4 .
Figure 4. Best F1 Score versus average nearest neighbors for the FIPS.Curves corresponding to different values of perplexity are very close and show the same trend, with a clear peak for a number of average nearest neighbors equal to 70.Given the perplexity and the average nearest neighbors, the value of σ is fixed.

Figure 5 .
Figure 5. Mean Precision at 10 versus average nearest neighbors for the FIPS.Also in this case, curves corresponding to different values of perplexity are very close, and show a peak around 70 average nearest neighbors, but the trends are more noisy than for Best F1.
The highest values of each indicator are in bold.

Table 2 .
Logistic regression carried out with FIPS and RCAcp(2013) to predict for y ∈[1996 − 2013].The asterisks indicate that all the coefficients are statistically validated within a 99.9% significance threshold.