Crystalline structure and grain boundary identification in nanocrystalline aluminum using K-means clustering

Nicolás Amigo

doi:10.1088/1361-651X/ab9dd9

1. Introduction

Machine learning (ML) is a branch of artificial intelligence that has been getting increased attention in the past decade. Its capability to work with huge data sets and recognize patterns in it have allowed researchers to predict outcomes for a given problem based on known properties [1]. For example, ML has been applied in medical images analysis to distinguish between healthy and malignant tissue [2–4], modeling of cancer progression [5–7], rupture risk assessment of cerebral aneurysms [8–10], risk management in credit card industry [11–13], among many others.

Materials science community has recently adopted ML techniques for the prediction of properties. Several works can be found in the literature. For example, Sun et al [14] employed support vector machines to predict the glass forming ability of bulk metallic glasses, Ward et al [15] evaluated the performance of different ML algorithms to predict new candidates of bulk metallic glasses, whereas Kikugawa et al [16] analyzed the relationships of the thermophysical properties among different liquids substances using artificial neural networks. ML techniques have also been applied to develop interatomic potentials from ab initio data, as described in the works of Zong et al [17], Pujra-Pun et al [18], and more generally, in the report of Deringer et al [19]. Despite the increasing popularity of ML in materials science, only a few studies have employed it to investigate crystalline structures. A remarkable example is the work of Chowdhury et al [20], where support vector machines, random forests, among others, were applied to identify dendritic morphologies using micrograph images. Another work is the one of DeCost et al [21], where convolutional neural networks were used to investigate microstructural trends and their relationship to processing conditions in carbon steel microstructures. More recently, Sharp et al [22] applied support vector machines to determine the atomic dynamics in grain boundaries using molecular dynamics simulations. Clustering has also been employed to evaluate the energy of atomic structures at the level of first principles calculations. For example, Meldgaard et al [23] studied the global optimization of molecular structures. For this purpose, a scheme consisting of ML techniques was proposed in which the K-means algorithm was applied to group atoms according the features of their local environment. Another work is the one of Chiriki et al [24], in which the authors proposed a method to generate low energy configurations by classifying atoms during the optimization process. All in all, the application of ML techniques to atomic systems has received an increasing interest over the last years, but much of their capabilities are yet to be unveiled.

In this work, statistical analysis and K-means clustering were performed to identify the crystalline structure and grain boundaries in nanocrystalline aluminum. This system is chosen due to its high stacking fault energy, making more difficult the formation of stacking faults during the annealing process. Several physical quantities such as potential energy, stress and atomic volume, which are called parameters hereafter, were calculated for each atom. Statistics was used as a tool to identify the relevant parameters that distinguish between face-centered cubic (fcc) atoms and grain boundary (GB) atoms in order to simplify the clustering analyzes. Then, the relevant parameters were employed with the K-means clustering algorithm to identify the atomic structure of a given atom in the NC–Al sample. This method of structure identification has the advantage over the traditional common neighbor analysis and centrosymmetry parameter tools in that it does not require input settings such as the number of nearest neighbor nor a cut-off value. This paper is organized as follows: section 2 explains the simulation procedure and the tools employed for the analysis, section 3 details the results, section 4 presents the discussion, and section 5 draws the conclusions.

2. Simulation details

The nanocrystalline aluminum sample, called NC–Al hereafter, was constructed using the Voronoi tessellation method as implemented in the code Atomsk [25]. The dimensions were set to 30 × 30 × 39 nm³, with a total of 1625 754 atoms distributed among 15 grains. Periodic boundary conditions were applied along the three directions. GBs relaxation was conducted by heating the sample up to 1000 K for 100 ns, and then quenching it to 10 K for another 100 ns. Al–Al interatomic forces were calculated using the potential developed by Mishin et al [26] following the embedded atom method (EAM) [27]. All MD simulations were carried out on the large-scale atomic/molecular massively parallel simulator (LAMMPS) [28] and the results were visualized using the Open Visualization Tool (OVITO) [29].

For each atom, the potential energy (U), the stress components (σ_ij), the von Mises stress (σ_vm), and the atomic volume (V) [30] were calculated. These quantities are called parameters hereafter. Since LAMMPS delivers the stress components in units of pressure × volume, σ_ij values were divided by the atomic volume V in order to obtain a stress in units of pressure. In addition, for comparison purposes between different quantities, each parameter was scaled to span a range from 0 to 10. This procedure was performed by means of the MinMaxScaler preprocessing utility available in the scikit-learn library for the Python programming. All subsequent calculations and analyses were carried out using scaled parameters unless specified otherwise.

For illustrative purposes, the clustering analysis to be delivered by the K-means algorithm was simplified a priori by identifying the relevant parameters that can distinguish between different structure types. These parameters were identified using statistical analyses. To this aim, two groups of atoms were set up using the common neighbor analysis (CNA) tool [31, 32]: one consisting of atoms that exhibit fcc structure, and another one consisting of atoms that belong to GBs. Atoms exhibiting other structure types were excluded from the study. For both groups, the medians of U, σ_ij, σ_vm, and V were calculated, and the difference was determined as

$\begin{equation}\text{Difference}=\left\vert 2\cdot \frac{{P}_{\text{fcc}}-{P}_{\text{GB}}}{{P}_{\text{fcc}}+{P}_{\text{GB}}}\right\vert \cdot 100\%,\end{equation} \tag{ 1 }$

where P stands for U, σ_ij, σ_vm or V. In this work, those parameters with high difference (above 10%) were identified as relevant. There are other available techniques to identify relevant parameters, such as non-parametric hypothesis tests. However, they were not used in this work since P-values are known to quickly go to zero in large samples [33].

Clustering was performed with the unsupervised learning method called K-means algorithm in order to identify the structure type of each atom. In order to simplify the clustering analysis, only the relevant parameters were considered in this step. The main goal of the K-means algorithm is to obtain k clustering centers, where each center corresponds to a cluster [34, 35]. At the beginning of the process, k cluster centers, called centroids, are randomly selected, and the distance between samples and centroids are calculated, assigning each sample to the nearest center. Then, all centroids are updated to a new value obtained as the average of each cluster. This process is repeated up to a given number of iterations. If an initial centroid is initialized far-off, its cluster may end up with just a few points. To overcome this drawback, the K-means++ initialization algorithm was employed. Here, the initial centroid is chosen randomly, but all subsequent centroids are allocated according their distance with the previous centroid. The optimal number of clusters to be used in the K-means algorithm was determined by means of the elbow method. A total of two clusters were obtained, one corresponding to the fcc structure and another to the GBs. The overall performance of the algorithm to identify atomic structures was inspected by comparing the results with the CNA method using diagnostic tools such as the confusion matrix, the F-1 score, and the Matthews correlation coefficient (MCC). All statistical analyses and ML methods were performed using the scikit-learn library for the Python programming language.

3. Results

3.1. Atomic structure characterization

Prior to carrying out statistical analyses and K-means clustering, it is mandatory to inspect the atomic structure of the sample. With this aim, CNA was performed revealing the structure type for each atom as shown in figure 1(a), where the blue color represents fcc structure, red GBs, and light blue hcp structure. By direct inspection, 87.5% of the sample corresponds to fcc structure, whereas 12.0% to GBs. Although the initial sample was composed only of fcc atoms and GBs, hcp structure nucleated due to the heating and quenching process, even forming stacking faults as shown in figure 1(b). Nevertheless, hcp atoms corresponds to 0.4% of the sample, and were not considered in this study. A comparison between the fraction of atoms for each structure type is presented in figure 2.

**Figure 1.** Structure identification of the NC–Al sample following the CNA method. Blue atoms correspond to fcc structure, red to GB, and light blue to stacking faults.
Download figure:
Standard image High-resolution image

**Figure 2.** Fraction of atoms for each atomic structure.
Download figure:
Standard image High-resolution image

3.2. Identification of relevant parameters

The U, S_ij, S_vm, and V parameters were calculated for each atom and scaled from 0 to 10 for direct comparison. Then, atoms were divided into two groups, one corresponding to fcc structure and another to GBs. For each group, the medians were calculated, obtaining the values presented in table 1. Overall, the medians are similar for both groups, except for U, S_vm, and V, which differ by ∼53%, ∼141%, and ∼34% respectively. These results are supported by the distributions in figure 3, since little overlapping is observed in each case. It is worth to note that the data set for the GB group is widely distributed due to the different arrangements that atoms can exhibit in the GBs. Thus, from the difference observed in the medians, U, S_vm, and V are recognized as relevant parameters to identify the atomic structure of the NC–Al sample.

Table 1. Medians for each parameter corresponding to the fcc and GB groups.

	fcc median	GB median	Difference (%)
U	1.827	3.172	53.82
S_xx	6.459	6.694	3.555
S_yy	6.464	6.704	3.655
S_zz	6.113	6.379	4.263
S_xy	5.547	5.528	0.340
S_xz	5.029	5.026	0.077
S_yz	5.107	5.122	0.283
S_vm	0.411	2.403	141.6
V	2.644	3.741	34.36

**Figure 3.** Distributions for the scaled (a) U, (b) S_vm, and (c) V parameters.
Download figure:
Standard image High-resolution image

3.3. K-means clustering

The K-means algorithm was performed to identify the atomic structure of the NC–Al sample. For this aim, the relevant parameters, this is U, S_vm, and V, were employed in conjunction with the algorithm. Two clusters of atoms were identified, one exhibiting fcc structure and another corresponding to GBs. The obtained clusters for each non-scaled parameter are shown in figure 4, where only 90% of the data is plotted for visualization purposes. As expected, the dispersion in the GB clusters is notorious compared to the well defined fcc clusters. The cluster centers are summarized in table 2.

**Figure 4.** Clusters obtained from the K-Means algorithm. Non-scaled parameters are shown.
Download figure:
Standard image High-resolution image

Table 2. Cluster centers for each parameter.

	fcc mean	GB Mean
S_vm (GPa)	1.294	6.568
U (eV)	−3.356	−3.291
V (Å³)	16.60	17.39

Although the above results give an insight of the algorithm's capability to identify atomic structure, they do not indicate whether the atoms structure was correctly identified or not. To achieve this goal, the confusion matrix was calculated using the CNA as comparison method, obtaining the results presented in figure 5. The left-top and right-bottom cells correspond to the number of fcc atoms and GB atoms correctly classified, respectively. In contrast, the right-top cell shows the number of GB atoms incorrectly classified as fcc atoms, whereas the left-bottom cell the number of fcc atoms incorrectly classified as GB atoms. A quick view reveals that a total of 1568 968 atoms (left-top + right-bottom cells) were correctly identified, in contrast to the 50 068 misclassified atoms (right-top + left-bottom cells). Since the data set is highly imbalanced, the accuracy of the algorithm was measured using the F-1 score and the MCC. The obtained value for the former was 0.969, while for the latter 0.859, indicating a good performance of the test.

**Figure 5.** Confusion matrix obtained from the K-means algorithm.
Download figure:
Standard image High-resolution image

A slab of the NC–Al system is shown in figure 6, where (a) displays the atomic structure colored following the CNA, whereas (b) the atomic structure according to the clustering procedure. Overall, both cases exhibit almost the same results, except for minor differences in the GBs. This observation is highlighted in figure 6(c), where only atoms that resulted from misclassifications are shown (right-top and left-bottom cells in figure 5). Interestingly, these atoms correspond to the fcc-GB interface, a zone that is hard to define.

3.4. K-means initialization test

A well-known feature of the K-means algorithm is that the resulting clusters depend on the initial centroids. In order to assess the impact of initialization on the resulting clusters, the algorithm was run 10 times using different seeds, repeating the analysis performed in section 3.3. Interestingly, all runs delivered clusters with identical means. Table 3 show the results for three different initialization called I₁, I₂, and I₃. Differences were observed from the fifth decimal place onward (not shown in the table). The confusion matrix for each case was also calculated, obtaining little differences in the structure type classification. For example, the matrices corresponding to the I₁, I₂, and I₃ initialization (see figure 7) exhibited differences up to ∼30 atoms, being negligible compared to the total number of atoms in the system. Finally, the computed F-1 scores and MCCs in all cases were 0.969 and 0.859, respectively.

Table 3. Cluster centers for each initialization.

	I₁		I₂		I₃
	fcc mean	GB mean	fcc mean	GB mean	fcc mean	GB mean
S_vm (GPa)	1.295	6.568	1.295	6.568	1.294	6.568
U (eV)	−3.356	−3.291	−3.356	−3.291	−3.356	−3.291
V (Å³)	16.60	17.39	16.60	17.39	16.60	17.39

**Figure 7.** Confusion matrix for initialization (a) I₁, (b) I₂, and (c) I₃.
Download figure:
Standard image High-resolution image

4. Discussion

K-means clustering proved that per-atom U, S_vm, and V were relevant parameters to distinguish between fcc structure and GBs. A total of 50 068 atoms were incorrectly classified. Interestingly, all these atoms belong to the fcc–GB interface, a zone hard to define and where even the CNA struggles to identify. Therefore, the obtained misclassifications can be regarded as noise, which could be eventually reduced by including other quantities, such as the local geometry of a given atom or by performing other clustering algorithms. Both topics will be further explored in future works to figure out whether the results yielded by unsupervised learning are enhanced or not.

The method presented here depends on locally computed properties available in molecular dynamics simulation, and does not require additional input parameters. As opposite examples, the centrosymmetry parameter requires the number of nearest neighbors as input setting, and the CNA requires a cut-off value according to the lattice parameter of the crystalline structure. Moreover, most traditional tools have been developed to study crystalline materials, diminishing their performance when applied to other systems. Nevertheless, despite the fact that no additional input parameters are involved in the K-means method, a cutoff must indeed be set for interatomic potentials in classical MD simulations. Thus, all parameters considered here, such as the potential energy, stress components, among others, depend on the cutoff value used in the interatomic potential. This issue can be avoided by performing first principles calculations, but a well-known drawback is their high computational cost to study systems involving a large number of atoms.

A relevant matter of the K-means algorithm is that the resulting clusters depend on the initial selection of centroids. This subject was covered by running the algorithm several times, obtaining the same means in all cases. Furthermore, the structure classification obtained from different runs differed by just a few atoms. Hence, the K-means method arguably delivers a consistent classification of the structure type despite its randomness. However, a thorough investigation is required in order to further support this observation.

It is important to remark that here the CNA was used for two tasks. In the first one, it was employed to identify atoms that belong to the crystalline structure and those that belong to the grain boundaries. Thus, a set of parameters were computed for each group, such as potential energy, stresses components, among others, allowing the identification of relevant parameters to be used in the K-means algorithm. Another possible approach is not to identify such quantities, but to employ all of them. This would have led to the cumbersome analysis of several clusters as those shown in figure 4, where most of them would have shown indistinguishable fcc and GB atoms. In the second task, the CNA was used as a comparison tool to construct the confusion matrix of the results delivered by the K-means method, allowing the computation of the F-1 score and the MMC. In summary, in both tasks the CNA was employed as a diagnostic tool for analysis purposes.

All in all, the K-means algorithm in conjunction with the per-atom potential energy, von Mises stress and atomic volume, have the potential to be applied in the structure identification of more complex systems. For example, it could be eventually employed to identify the local structure of impurities [36, 37], martensitic transformation [38, 39], amorphous structures [40, 41], voids [42, 43], among others.

5. Conclusions

The purpose of this work was to obtain relevant, yet simple, physical quantities that can be employed to identify the atomic structure of NC–Al. Statistical analysis and K-means clustering were performed for this aim, revealing that the per-atom potential energy, von Mises stress and atomic volume discriminate between fcc structure and GBs. The method presented in this work has the advantage over traditional ones, such centrosymmetry parameter and common neighbor analysis, in that no number of nearest neighbors nor cut-off values are required as input settings. Hence, the approach presented in this work could be extended to identify more complex structures, such as those of binary alloys, amorphous materials, dislocations, vacancies, among others, where traditional tools fail.

Acknowledgments

Powered@NLHPC: this research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02).

Crystalline structure and grain boundary identification in nanocrystalline aluminum using K-means clustering

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Simulation details