Machine-learning accelerated identification of exfoliable two-dimensional materials

Mohammad Tohidi Vahdat; Kumar Varoon Agrawal; Giovanni Pizzi

doi:10.1088/2632-2153/ac9bca

1. Introduction

Two-dimensional (2D) materials have been extensively explored for various applications across several disciplines [1–7]. In part, the scientific excitement about 2D materials originates from their ultimate thinness, making them exceptionally promising for applications, e.g. in electronics [8–10], in catalysis [11–13], or in molecular separations [14–16]. Moreover, the physical properties of monolayers are often distinct from those of their parent three-dimensional (3D) materials, since charge and heat transports are restricted to a plane [17]. Therefore, this makes them attractive not only for applications, but also for fundamental science [18–21]. Furthermore, the possibility of realizing stacks of 2D layers into van-der-Waals (vdW) heterostructures opens the way to a combinatorially large number of novel properties to be explored [22–26].

Decades ago, it was demonstrated that layered vdW materials, such as metal dichalcogenides, can be mechanically and chemically exfoliated into a few or even single layers [27]. Since the experimental demonstration of the exfoliation of graphene [28], the interest in these materials and in the experimental approaches to exfoliate them has sparked [29–32]. However, experimentally synthesizing these materials is time-consuming, only a fraction of them have ever been synthesized, resulting in a limited diversity of the 2D materials that are investigated experimentally. First-principle simulations, and in particular density-functional theory (DFT) calculations, have been thus recently employed to determine the binding-energy strength between layers and consequently assess whether a structure can be exfoliated, resulting in the creation of databases of potential 2D materials [33–39]. However, such calculations are quite expensive and require deep understanding of the simulation methods to obtain accurate results, as well as the capability of using a workflow infrastructure to manage the resulting large number of calculations [40–43]. Therefore, first-principles simulations might be a barrier when needing a fast, yet accurate, prediction of the exfoliability of candidate materials.

Recently, different approaches based on machine learning have been employed to optimize the synthesis process of 2D materials [44–50]. Here, we develop a combined geometrical and machine-learning (ML) model that can provide accurate and fast predictions on whether a bulk 3D material is formed by layers held together by a weak binding energy and, thus, is potentially exfoliable into 2D layers. Our multi-step procedure, illustrated in figure 1, starts by pre-screening layered structures based on geometrical criteria requiring only the atomic positions of the atoms in the structure. The resulting filtered structures are featurized, and finally a ML model based on a random forest classifier is applied to assess whether the material can be exfoliated or, instead, has a high binding energy (HBE).

**Figure 1.** Schematic illustrating the steps of our algorithm to efficiently determine the exfoliability of 3D materials. From left to right: 1 the procedure initiates from the unit cell of a bulk crystal structure. 2 A $3\times 3\times 3$ supercell of the bulk primitive cell is constructed, and chemical bonds are identified by comparing the bond distances with the atomic vdW radii. 3 Low-dimensional subunits are determined based the rank of the lattice vectors embedded into chemically bonded units, and a 2D primitive cell is built (*LowDimFinder* geometrical pre-filtering step). 4 The crystal structures passing the previous steps are represented with numeric features, used as input of a random forest classifier (step 5) trained and tested on a set of over 2000 binding energies calculated with DFT. Our ML model predicts if the material has a high binding energy (HBE) and should be discarded, thus filtering out potential 2D structures for further analysis (step 6).
Download figure:
Standard image High-resolution image

**Figure 1.** Schematic illustrating the steps of our algorithm to efficiently determine the exfoliability of 3D materials. From left to right: 1 the procedure initiates from the unit cell of a bulk crystal structure. 2 A $3\times 3\times 3$ supercell of the bulk primitive cell is constructed, and chemical bonds are identified by comparing the bond distances with the atomic vdW radii. 3 Low-dimensional subunits are determined based the rank of the lattice vectors embedded into chemically bonded units, and a 2D primitive cell is built (*LowDimFinder* geometrical pre-filtering step). 4 The crystal structures passing the previous steps are represented with numeric features, used as input of a random forest classifier (step 5) trained and tested on a set of over 2000 binding energies calculated with DFT. Our ML model predicts if the material has a high binding energy (HBE) and should be discarded, thus filtering out potential 2D structures for further analysis (step 6).
Download figure:
Standard image High-resolution image

We train and then evaluate the performance of the model using a set of 2392 structures from [37], obtained from the preliminary geometrical step applied on a large initial set of 108 423 unique 3D structures originating from the Inorganic Crystal Structure Database (ICSD) [51] and Crystallography Open Database (COD) [52]. To capture the underlying structural patterns that correlate with the exfoliability of a material, we first reduce the number of descriptors needed by the model, and then use a SHapely Additive exPlanations (SHAP) analysis to explain the most relevant features. We also compare our best-performing ML model with a deep neural network. Finally, in order to make our methodology and code immediately available to all researchers without the need of any software installation, we implement and publish an online web tool on the Materials Cloud web platform [53], available at http://ml-layer-finder.materialscloud.io and working directly in the web browser. The tool provides the opportunity, even for non-experts of simulations, to just upload a crystal structure in a variety of file formats and rapidly evaluate if it can be exfoliated into 2D layers or not.

2. Methods

2.1. Description of the crystal structures and the method used to identify layered compounds

In the following, we will indicate exfoliable 2D materials as a positive hit of a model, while a HBE material as a negative hit. Therefore, false positives indicate HBE systems that are instead predicted to be exfoliable (2D), while false negatives indicate exfoliable systems that the model suggests to discard as HBE.

Our screening procedure (steps 1 and 2 of figure 1) starts from the comprehensive initial set of 108 423 unique bulk 3D crystals extracted from ICSD and COD, used in the study by Mounet et al [37, 54]. In [37], the preliminary geometric step (that we will refer to as the LowDimFinder) identifies chemically-bonded subunits by comparing interatomic distances d_AB between pairs of atoms A and B in the structure with the corresponding atomic vdW radii r_A and r_B (radii are taken from [55]). We consider two atoms as bonded if $d_{AB} \leqslant r_A + r_B - \Delta$ , where Δ is a tunable parameter (see below). The dimensionality of each connected manifold is then obtained from the rank of the set of 3D lattice vectors embedded into the manifold (see [37] for more details). One main advantage of this approach is identifying subunits of any dimensionality (i.e. 2D layers, 1D chains, and 0D clusters). Moreover, in the case of 2D layers, no specific orientation is assumed for the 2D plane with respect to the Cartesian axes.

For the purposes of preliminary filtering, we wish to have a very low rate of false negatives (i.e. a very high recall rate) so as not to miss important materials, while at the same time we can accept a moderate amount of false positives, because these can be further filtered later (first by the ML model discussed here, and possibly later by accurate DFT simulations following the application of our ML model). Therefore, also considering that this first filtering is only aware of distances between atoms and not of their chemistry (except for the atomic vdW radii), we consider five values for the tunable parameter, namely $\Delta = 1.1, 1.2,$ $1.3, 1.4, 1.5$ Å. Any crystal structure recognized as 2D for at least one of the five values of Δ is then handed to the ML model in the following steps.

2.2. Choice of descriptors for 3D compounds

A crucial preliminary step before implementing and training a ML model is to represent crystal structures in terms of numerical features representing the material chemistry and geometry.

Different strategies have been developed in the literature [56–59]. In this work, all the descriptors are chosen from those available and implemented in the matminer package [60]. Choosing the right descriptors requires care and testing, and is dependent on the details of the descriptors and the properties of interest. As such, there is not a single unique correct way to represent crystal structures. We start from a set of potentially relevant features and, as a first step, we identify which of these are most relevant (see supplementary information section 1 for details). Our final set of features are based on those proposed by Ward et al [61], where space is partitioned into Wigner–Seitz cells using Voronoi tessellation. These cells (including all points in space closer to a central atom than to any other atom) are then used to assign neighbors and then construct local descriptors of the environment that are only weakly sensitive to small changes that might occur during a geometry relaxation. Derived from the Voronoi tessellation, several attributes are proposed in [61]. In the following we will restrict to three different sets of features that we selected for our final model: chemical ordering, maximum packing efficiency, and local environment attributes. The first one, chemical ordering, describes how much the placement of species in a structure deviates from random, using Warren–Cowley-like ordering parameters [62]. The second one, maximum packing efficiency, measures the largest sphere that fits inside each Voronoi cell. The last, local environment attributes, are based on comparing elemental properties (such as the electronegativity and 21 other elemental properties) of the central atom to its neighbors, weighted by the surface area of the faces of the Wigner–Seitz cells. More technical details can be found in [61].

2.3. Model selection and training

To achieve a ML model that can classify potentially exfoliable materials, we train the model on binding energies for the 2392 structures prefiltered by the LowDimFinder, that were computed using DFT in [37]. We first center the features at zero and rescale them using their mean and standard deviation, respectively. In order to determine the quality of the model on an unseen data set, train-test splitting is performed randomly, and the size of the train set is 70%. All the statistics reported later are computed by averaging over ten different random seeds used for train-test splitting.

We tested various learning approaches, including deep learning architectures. Although some of the architectures perform well (see supplementary information section 2), we decided to employ and discuss further in the following a random forest classifier ML model [63], as it is relatively simple and, at the same time, it demonstrates an excellent performance for our goal. The random forest classifier is trained with a number of trees of 600 and a maximum depth of trees of 17.

3. Discussion and results

Mounet et al [37] performed binding energy calculations on 2662 prospective layered structures and identified those that are held together by weak interactions and are ready for exfoliation. We consider 2392 structures out of 2662 (two different vdW DFT functionals are used in [37]; here, we restrict only to those for which binding energies are calculated using the vdW-DF2-C09 functional [64–66], so as to have a consistent calculation method of binding energies for all structures in the dataset). In particular, our portfolio contains 1813 structures that are determined to be exfoliable into 2D layers by DFT simulations including vdW interactions (vdW-DFT), defined as those whose binding energy is lower than 130 meV Å⁻², while 579 structures have HBE (although the first LowDimFinder preliminary filtering had not discarded them).

Figure 2 illustrates the combined performance of the preliminary LowDimFinder geometrical filtering followed by our ML model, with respect to the accurate vdW-DFT calculations, resolved with respect to the set of dimensionalities of the sub-units identified by LowDimFinder (the performance of LowDimFinder alone can be better inspected in supplementary figure S1). In particular, in figure 2 we see that systems that are tagged by LowDimFinder as only having 2D subcomponents for any value of Δ (labeled as [2]) are those with highest probability of displaying low binding energy, and thus promising for exfoliation. Moreover, this set also has a very low rate of false positives. On the other hand, structures marked also displaying a zero-dimensional subunit for at least one value of Δ (and in particular [0,2] and [0,2],[3]) include a relatively large fraction of systems that are confirmed to have HBE by vdW-DFT. Nevertheless, these results demonstrate that, while the dimensionalities found by LowDimFinder alone can provide an indication of the probability of finding false positives, they are not sufficient (alone) to provide a reliable method to reduce their number without also increasing significantly the rate of false negatives (we remind that our design strategy is to have the smallest amount possible of false negatives, but avoiding to miss important 2D candidates).

Therefore, we train our ML model (based on the random forest classifier) to improve the classification of structures in 2D and HBE, still at a negligible fraction of the cost of a full DFT simulation. As visible in figure 2, the ML model is able to correctly label most HBE systems that had passed the LowDimFinder prefilter (true negative systems, in blue). Detailed results on the performance of our ML model is reported in table 1. In the test dataset, we have 664 structures, including 139 HBE structures and 525 exfoliable structures. Our model managed to correctly predict 513 structures as true positives (i.e. a recall of 98%, very close to our ideal goal of 100%), while at the same time reducing the number of false positives of the LowDimFinder by about 50%. Therefore, our model reaches our goal of significantly filtering the number of candidates that can go to an ultimate screening step using DFT, thus reducing the associated computational cost.

Table 1. Performance of our ML model (based on a random forest classifier) on the test data set. Our model captures 513 out of 525 structures correctly (true positives) resulting in a very high recall rate of 98% (i.e. the model has very few false negatives). Additionally, it reduces the number of false positives by ≈ 50% with respect to the initial LowDimFinder geometrical screening.

				Total number
Label	Precision	Recall	F1-score	of structures
HBE	85%	49%	62%	139
2D	88%	98%	93%	525

To better understand the distribution of false positives and false negatives of our ML model, we first show in figure 3 the binding energies for all compounds included in our training set, as a function of the change in the interlayer distance with two DFT functionals (including or not vdW interactions, respectively, as an approximate proxy for the strength of vdW interactions in the material). This figure mirrors figure 2(c) of [37], while adding information on the performance of our ML model. The figure visualizes graphically the model performance, and shows that false positives and negatives do not have a specific distribution with respect to the two variables of the scatter plot.

We then use the SHAP technique [67] to gain a deeper understanding of what the model learned in the training process. SHAP can disclose how the features used by the model affect the predictions. Figure 4 shows the five most important variables of our model: maximum packing efficiency, maximum number of unfilled s states, chemical ordering (first neighbor), mean of periodic table row, and chemical ordering (third neighbor). The importance here is defined as the mean absolute SHAP value of all the points in the dataset. The colors in the figure denote the value of the input variable, where red indicates high and blue indicates low values. Consequently, red on the right-hand (left-hand) side of the plot implies a positive (negative) correlation with exfoliation.

Figure 4 shows that maximum packing efficiency is the most important feature and has a negative correlation with exfoliation, consistent with what one can expect as this feature indicates the largest sphere that fits inside each Voronoi cell, and structures with high packing efficiency will be most probably not layered (packing ratio was already used as a filtering criterion in one of the early DFT studies to predict 2D materials [33]). The second most important feature is the maximum number of unfilled s states. Indeed, as the number of unfilled s states increases, the material will tend to form stronger bonds and will be consequently less prone to break them and be exfoliated, justifying its negative correlation. Another important variable is chemical ordering (both for first and third neighbors). This variable is associated with the arrangement of the crystal structure, where structures with ordered arrangements will have a larger value (closer to 1), while random arrangements will have a value closer to zero. Also in this case, the negative correlation visible in figure 4 can be expected, because structures very ordered configurations are typically strongly bonded 3D materials that are not composed by exfoliable layers. Finally, another important feature is the mean of the periodic table row, indicating a tendency of lighter elements (i.e. small value for the periodic table row) to form exfoliable materials.

Finally, since our aim is to reduce as much as possible the number of false negatives, we report explicitly the 12 exfoliable structures that are considered as HBE by our ML model. These are schematically represented in figure 5 (the complete list is shown in supplementary information table S2). We highlight in particular that 6 structures out of 12 contain halogen elements, and 3 contain sulfur. This result can be used as a guide for further improvements of the model, e.g. to include further descriptors that can help in correctly identifying structures with low binding energy (we note, however, that other structures including the same chemical elements exist, where the model works correctly: therefore their sole presence cannot be used as a simple criterion to identify false negatives).

**Figure 5.** Summary of the 12 false negatives of our ML model (out of the 525 exfoliable structures in our test set). In addition to LiMnSe₂, six structures include halogens, three include sulfur and two include bismuth.
Download figure:
Standard image High-resolution image

4. Online tool

In order to give easy access to our model and its predictions without the need of preliminary expertise, we also implement an online web tool that we publish on the Materials Cloud web platform [53] at the URL http://ml-layer-finder.materialscloud.io. The tool works directly in the browser and does not require any installation.

In the first selection page, the user can upload a bulk crystal structure. Several common formats are supported thanks to the parsers implemented in the Atomic Simulation Environment (ASE) [68] and pymatgen [69] libraries. Once the bulk structure is uploaded (or selected from already available examples), the tool performs the needed computations to first determine if the material is layered, and then to predict if it is exfoliable. In particular, the tool (following the workflow of figure 1) first computes the bonds and then identifies the dimensionality of each connected sub-manifold using the LowDimFinder. Any subunit that is a 2D layer is displayed on the output web page. The tool then generates feature vectors for these structures, and then it uses the pre-trained ML model to predict if the crystal structure can be exfoliated or has a HBE. The result is provided on output, together with a table summarizing the SHAP value of the most important features for the uploaded structure (see figure 6).

**Figure 6.** Screenshot of the online tool implementing the algorithms of this paper, available on the Materials Cloud [53] Work/Tools section. Once a material is uploaded, the output page displays the prediction of exfoliability by our ML model. Here, a screenshot of a part of the tool results page is shown, displaying the ML model prediction in the case of graphite. The output page of the tool includes more information, such as the visualization of the crystal structure and its decomposition into 2D layer components.
Download figure:
Standard image High-resolution image

5. Conclusion

We presented an efficient yet accurate method combining a preliminary geometrical screening followed by a random-forest-classifier ML model to identify bulk compounds that can be exfoliated into 2D materials. The procedure that we developed provides an effective and straightforward prefilter to assess whether any bulk structure has the potential to be exfoliated: thanks to its high recall of 98%, our method has an extremely low rate of false negatives, i.e. of exfoliable materials that would be discarded by the model and thus be missed in a search for interesting candidates. Yet, it helps in significantly reducing the number of candidates that need to be sent to more accurate but expensive filtering steps (e.g. by computing binding energies using DFT), thus significantly lowering the total workflow computational cost. By means of a SHAP analysis, we also provide an explanation of the most important features used by our model to determine exfoliability. These include the maximum packing efficiency, the maximum number of unfilled s sates, the chemical ordering and the mean of the periodic table row. Finally, in order to enable even non-expert fellows to run our model and predict exfoliability of new materials, we implemented and deployed it as an online web tool published on the Materials Cloud web platform, only requiring to provide the parent crystal structure as input.

Acknowledgments

This research was supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 205602). The authors gratefully acknowledge Nicola Marzari and Seyed Mohamad Moosavi for fruitful discussions, and Nicolas Mounet and Philippe Schwaller for the initial implementation of the LowDimFinder code used in [37].

Data availability statement

Supplementary information is available for this paper. Data that were used and are needed to reproduce this study are deposited on the Materials Cloud Archive [53] and are available at https://doi.org/10.24435/materialscloud:m4-7f.

Code availability

The code for parsing, featurization of structures and data analysis is available free of charge on Github at https://github.com/Mohammad-vahdat/Machine_learning_accelerated_identification_of_exfoliable_two_dimensional_materials.

Machine-learning accelerated identification of exfoliable two-dimensional materials

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Methods

2.1. Description of the crystal structures and the method used to identify layered compounds

2.2. Choice of descriptors for 3D compounds

2.3. Model selection and training

3. Discussion and results

4. Online tool

5. Conclusion

Acknowledgments

Data availability statement

Code availability

Machine-learning accelerated identification of exfoliable two-dimensional materials

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Methods

2.1. Description of the crystal structures and the method used to identify layered compounds

2.2. Choice of descriptors for 3D compounds

2.3. Model selection and training

3. Discussion and results

4. Online tool

5. Conclusion

Acknowledgments

Data availability statement

Code availability