Using convolutional neural networks for stereological characterization of 3D hetero-aggregates based on synthetic STEM data

The 3D nano/microstructure of materials can significantly influence their macroscopic properties. In order to enable a better understanding of such structure-property relationships, 3D microscopy techniques can be deployed, which are however often expensive in both time and costs. Often 2D imaging techniques are more accessible, yet they have the disadvantage that the 3D nano/microstructure of materials cannot be directly retrieved from such measurements. The motivation of this work is to overcome the issues of characterizing 3D structures from 2D measurements for hetero-aggregate materials. For this purpose, a method is presented that relies on machine learning combined with methods of spatial stochastic modeling for characterizing the 3D nano/microstructure of materials from 2D data. More precisely, a stochastic model is utilized for the generation of synthetic training data. This kind of training data has the advantage that time-consuming experiments for the synthesis of differently structured materials followed by their 3D imaging can be avoided. More precisely, a parametric stochastic 3D model is presented, from which a wide spectrum of virtual hetero-aggregates can be generated. Additionally, the virtual structures are passed to a physics-based simulation tool in order to generate virtual scanning transmission electron microscopy (STEM) images. The preset parameters of the 3D model together with the simulated STEM images serve as a database for the training of convolutional neural networks, which can be used to determine the parameters of the underlying 3D model and, consequently, to predict 3D structures of hetero-aggregates from 2D STEM images. Furthermore, an error analysis is performed with respect to structural descriptors, e.g. the hetero-coordination number. The proposed method is applied to image data of TiO2-WO3 hetero-aggregates, which are highly relevant in photocatalysis processes. However, the proposed method can be transferred to other types of aggregates and to different 2D microscopy techniques. Consequently, the method is relevant for industrial or laboratory setups in which product quality is to be quantified by means of inexpensive 2D image acquisition.


Introduction
The properties of many functional materials depend to a large extent on their structure and chemical composition.Hence, measuring both is mandatory in order to understand and improve their effective properties.An important class of materials are hetero-aggregates, which are compositions of at least two dissimilar classes of primary particles, called, for the sake of simplicity, particles from now on.Properties of hetero-aggregates can be quite different in comparison to aggregates that consist of monodisperse particles.A prominent example in applications concerned with photocatalysis are hetero-aggregates made of titanium dioxide (TiO 2 ) and tungsten trioxide (WO 3 ) [33,40,43].The combination of both materials leads to aggregates with hetero-junctions, i.e., points at which two particles made from different materials touch.At such junctions, photogenerated electron-hole pairs are spatially separated, hindering their direct recombination, which results in a higher photocatalytic activity compared to pure TiO 2 [25].
In order to accurately investigate the properties of hetero-aggregates with imaging techniques, it is essential to resolve the individual particles within the structure.A suitable tool for the characterization of hetero-aggregates, consisting of particles with radii of a few nanometers, is (scanning) transmission electron microscopy, (S)TEM.With a spatial resolution in the sub-nanometer regime, even the atomic structure can be investigated.However, in conventional STEM only two-dimensional (2D) projection images of the aggregates can be acquired while information about the third dimension is lost.This problem can be overcome using STEM tomography, where the sample is tilted with respect to the electron beam such that a series of projection images under various projection angles is acquired, see [28].From this series of STEM projection images, the three-dimensional structure can be reconstructed, e.g., with iterative reconstruction techniques [12].The major disadvantage of STEM tomography is the fact that acquisition of a single tilt series can take several hours, and thus, this method does hardly allow for the investigation of a large number of aggregates.Furthermore, many samples do not allow for such a long measurement as heteroaggregates and nanoparticles can change their structure and arrangement during extensive exposure to the electron beam, hindering the reconstruction.
As opposed to STEM tomography, 2D STEM images can be acquired within a few seconds, allowing for the acquisition of several images of various aggregates in a reasonable amount of time.For this reason, it is desirable to use 2D STEM images in order to characterize the 3D morphology of aggregates.This can be achieved by training neural networks to predict structural properties of 3D hetero-aggeregates from 2D STEM images.However, the training of neural networks requires a broad database of pairs of differently structured heteroaggregate and corresponding 2D STEM images.The experimental acquisition of such a database, i.e., the synthesis of differently structured aggregates and their imaging would be expensive in both time and resources.Alternatively, simulated image data can be used for training purposes, see [10] for a similar approach.
In the present paper, a stochastic 3D model for the generation of virtual aggregates and a physics-based STEM model for the simulation of corresponding 2D STEM images is combined in order to provide training data.In other words, methods of stochastic geometry [4] are utilized to derive a parametric model for the generation of a wide spectrum of virtual, but realistic aggregates.Additionally, the virtual structures are passed to a physics-based simulation tool in order to generate virtual scanning transmission electron microscopy (STEM) images.The preset parameters of the 3D model together with the simulated STEM images serve as a database for the training of convolutional neural networks, which can be used to predict the parameters of the underlying 3D model and, consequently, to predict 3D structures of hetero-aggregates from 2D STEM images.In literature, there are already CNN-based approaches that do not use stochastic geometry models to generate stochastically equivalent 3D structures from 2D images [19].However, the presented approach aims to generate such digital shadows by combining a well-established parametric stochastic 3D model and a CNN-based approach.In order to use such a parametric stochastic 3D model to generate digital shadows of hetero-aggregates, appropriate values of the model parameters must be chosen.The focus of the present paper is on this calibration procedure, also called model fitting.
More specifically, it is investigated how convolutional neural networks (CNNs) [13,16] can be used to determine the parameters of the stochastic 3D model and, consequently, to generate digital shadows of 3D aggregates, from 2D STEM images.CNNs are a type of artificial neural networks commonly used in image analysis and recognition tasks, see e.g.[22].They consist of multiple layers of neurons that learn to recognize patterns and features in the input data through a calibration process, called training.
In conventional spatial stochastic modeling of complex 3D morphologies, the process of model fitting typically involves several steps, see for example [31,42].First, image data has to be acquired, preprocessed, and segmented.Subsequently, an appropriate model type is chosen, and its model parameters are adjusted accordingly using descriptive statistics of the segmented image data.However, the approach considered in the present paper differs from the classical one.On the one hand, the image data does not have to be segmented, which is advantageous since image segmentation can be a time-consuming complex task.Moreover, the model parameters are predicted by the neural networks directly, meaning that the descriptive statistics are not chosen by hand.This allows for the use of stochastic 3D models with parameters which are not easily estimatable from the image data.
In order to evaluate the performance of such a CNN-based approach, structural descriptors of aggregates drawn from the stochastic 3D model with preset parameter values are compared with structural descriptors of aggregates drawn from the 3D model with parameter values predicted by the CNN-based approach.
However, the structural similarity of the measured image data of aggregates and image data drawn from the fitted 3D model strongly depends on two factors, (i) the suitability of the chosen model type for the given data, and (ii) the ability of the selected CNN approach to determine the parameters of the stochastic 3D model from 2D STEM image data.More specifically, when analyzing measured image data of experimentally synthesized hetero-aggregates, there might not be any configuration of model parameters that results in a high-quality fit.In this case, the dissimilarities between the original image data and its digital shadows, generated by the fitted model, cannot necessarily be attributed to the fitting procedure, but rather to the inadequate choice of the model type.Thus, in the present paper, to be able to attribute these dissimilarities to an inadequate CNN approach, including data preprocessing, model architecture and learning procedure, the same stochastic 3D model is used as both the generator for the training data and the model to be fitted.
For an adequately designed CNN approach and adequately chosen type of the stochastic 3D model, the digital shadows drawn from the fitted 3D model should be statistically equivalent to experimentally synthesized aggregates in terms of their 3D structure and chemical composition.Then, these digital shadows can be used as geometry input of (spatially resolved) numerical modeling and simulation, to determine their functional properties, see e.g.[30,34].In this way, 3D imaging techniques like STEM tomography of the aggregates can be avoided in order to derive quantitative process-structure or structure-property relationships for hetero-aggregates.Note that by means of such relationships, optimized specifications of process parameters can be deduced, which lead to hetero-aggregates with desired structures and properties.Digital shadows used for structure-property optimization are also referred to as digital twins.Their implementation will be the subject of a forthcoming study.
The present work is organized as follows: In Section 2 the CNN-based approach is described to predict the 3D structure of hetero-aggregates from 2D STEM images.In particular, the generation of synthetic training data is explained which are used for the prediction of model parameters.Then, in Section 3, the results are presented which have been obtained for various aspects of model parameter prediction.Section 4 compares the methods developed in the present paper with analysis tools considered in the literature.Section 5 concludes.

Methods
This section provides details how the presented CNN-based approach is built for predicting the 3D structure of hetero-aggregates from 2D STEM images.It comprises two main steps.First, virtual but realistic STEM images are generated from simulated 3D image data.More specifically, synthetic aggregates are drawn from a stochastic 3D model with preset model parameters, where the latter describe the aggregation procedure simulated by the model and therefore influence structural properties of the generated aggregates.These aggregates are then used to generate corresponding STEM images by means of a physicsbased simulation tool, see Figure 1a.Systematically varying the parameters of the stochastic 3D model provides a wide range of differently structured ag-gregates and their STEM images.In a second step, visualized in Figure 1b, the parameters of the stochastic 3D model together with the simulated STEM images serve as a database for the training of CNNs, in order to learn how to reconstruct the parameters of the stochastic 3D model from STEM images.For the reconstruction, initially, a CNN extracts features from STEM images which characterize the depicted structure of aggregates in an informative but not necessarily interpretable manner.Then, these features are utilized to predict our interpretable predefined model parameters.For more details, see Section 2.3.This approach is designed to allow for quick and accurate prediction of model parameters for real hetero-aggregates from measured STEM images and, consequently, to predict the 3D morphology of hetero-aggregates from 2D STEM images.
The quality of the predictor is evaluated with respect to the similarity between predefined and predicted model parameters.Recall that interpretable model parameters describe the aggregation procedure simulated by the stochastic model.Thus, a good match between predefined and predicted model parameters can already be an indication for a good structural match between aggregates generated by the model with predefined/predicted parameters.Nevertheless, some structural descriptors (i.e., quantities which characterize the structure of aggregates like hetero-coordination number) may be sensitive with respect to changes in the model parameters.
Therefore, the quality of the predictor is further evaluated by comparing structural descriptors of aggregates drawn from stochastic 3D models with predefined and predicted parameters, respectively, see Figure 1c,d.The structural descriptors considered in this paper, which are chosen due to their relevance in process engineering, are displayed in Table 1.They are complementary to the features, utilized in the model parameter prediction.Furthermore, these descriptors are interpretable and characterize the 3D structure of the aggregates (whereas the features describe the structure observed in 2D images).

Generation of synthetic training data
The use of synthetic training data requires careful attention to ensure that the artificially generated data accurately reflects particularities of experimentally measured data such that a regression model (e.g., a CNN) trained on synthetic data can be extended to new, real-world data.More precisely, if the generation of realistic data is successful, a network trained on this data can be used for applications on real-world data, and thus, reducing the amount of experimentally measured and labeled training data.In the present study, synthetic training data was generated through a threestep process.First, virtual hetero-aggregates were generated using a stochastic 3D model.Then, using a physics-based simulation tool, STEM intensities were determined based on the material and thickness of the aggregates.Finally, virtual but realistic STEM images were computed by adding noise and other sources of variability to the previously determined STEM intensities.In the following, the stochastic 3D model is introduced and then more details about each of the data generation steps mentioned above is provided.
These model parameters control the fractal dimension, the mixing ratio, and clustering properties of the hetero-aggregates, respectively.
Throughout this paper, a spherical particle is defined as a triplet p = (x, r, l) of particle position x ∈ R 3 , radius r ∈ R + = (0, ∞) and label l ∈ {0, 1}.Moreover, a hetero-aggregate A, consisting of N particles for some fixed N ∈ N, is a set of connected and non-overlapping spherical particles, i.e., In this context, two particles p, p ′ ∈ A are said to be connected if for some j ∈ {2, . . ., N } there is a set of indices {i 1 , . . ., i j } ⊂ {1, . . ., N } with p = p i1 and p ′ = p ij , such that where ∥y∥ = 3 k=1 y 2 k denotes the Euclidean norm of y = (y 1 , y 2 , y 3 ) ∈ R 3 .The prefactor 1.01 in Eq. ( 2) represents the maximum distance of particles which is allowed to consider them to be in contact.It is determined to be 1% of the sum of their radii.Moreover, two particles p = (x, r, l), p ′ = (x ′ , r ′ , l ′ ) are said to be overlapping if the distance of their centers is smaller than the sum of their radii, i.e., ∥x−x ′ ∥ < r +r ′ .The label l of a particle p = (x, r, l) determines its material.More precisely, in our case, a particle with label l = 0 consists of WO 3 , whereas a particle with label l = 1 consists of TiO 2 .
The mixing ratio ρ of an aggregate A is defined as its fraction of particles with label l = 0, i.e., where # denotes cardinality.Notice the distinction in notation between θ ρ and ρ since these values are not necessarily equal.More precisely, the model parameter θ ρ can be set to an arbitrary value in the interval [0, 1] and it primarily influences the distribution of the structural descriptor ρ of aggregates generated with θ D f , as explained in more detail later on.Furthermore, the radius of gyration R g > 0 of an aggregate A is given by where m 1 , . . ., m N > 0 denote the particle masses and c 0 is the aggregate's center of mass.The stochastic 3D model described below is motivated by the idea that hetero-aggregates have a fractal-like structure [9,27].This fractal-like structure of an aggregate A can be quantified by the so-called fractal dimension D f , given by where k f > 0 is a fractal prefactor, which is set to 1.3, and a = 1 N N i=1 r i is the mean radius of the particles.For example, aggregates with a fractal dimension D f close to 1 are arranged in a nearly straight line, whereas those with a fractal dimension D f close to 3 are composed of densely packed particles.Thus, realistic hetero-aggregates have values for D f within the interval (1,3), see e.g.[3,6,7,8,9].
Note that the hetero-aggregate model presented in this paper is based on cluster-cluster aggregation, which involves a two-stage process for aggregate formation.In the first stage, primary particles aggregate to form small, homogeneous primary clusters.These primary clusters then undergo a second aggregation stage, leading to larger hetero-aggregates.
If an aggregate is homogeneous, i.e., all its particles share the same material, the labels {l i } N i=1 will be neglected, and therefore the description of the aggregate A can be compressed to In this case, the primary cluster model is introduced as a random set Φ N = {P i : 1 ≤ i ≤ N } ⊂ R 3 × R + which models the geometry of small homogeneous clusters of size N for some fixed N ∈ N, compare [1] for earlier work.Here, , where X i is a random vector and R i is a non-negative random variable describing the position and radius of a particle, respectively, for each i ∈ {1, . . ., N }.
The random variables R 1 , . . ., R N are independent and log-normally distributed with parameters µ = 12 nm and σ = 3 nm.However, the random vectors X 1 , . . ., X N which describe the particle positions, are recursively defined due to the dependency of X i on X 1 , . . ., X i−1 and R 1 , . . ., R i for all 1 < i ≤ N .This approach ensures that every realization of Φ N is a set of connected and non-overlapping particles, with a predetermined fractal dimension D f .Note that for technical reasons the random vector X i can take not only values from R 3 , but also the fictitious value ∞.The latter value is used to model invalid particle positions.
More precisely, X 1 = (0, 0, 0) and, under the condition that the values x 1 , . . ., x i and r 1 , . . ., r i+1 of X 1 , . . ., X i and R 1 , . . ., R i+1 are given for some i ∈ {1, . . ., N − 1}, the random vector X i+1 is uniformly distributed on some set L(A, r i+1 ) ⊂ R 3 , provided that (∞, r) ̸ ∈ A for all r ∈ R + and L(A, r i+1 ) ̸ = ∅, otherwise X i+1 = ∞.Here, A = {(x 1 , r 1 ), . . ., (x i , r i )} and L(A, r i+1 ) ⊂ R 3 is the set of all permissible particle positions x ∈ R 3 such that the set A ∪ {(x, r i+1 )} describes a cluster of connected and non-overlapping particles with fractal dimension D f being equal to some preset value θ D f ∈ (1, 3).In other words, L(A, r i+1 ) is the set of positions where a particle of radius r i+1 can be added to the cluster A without violating the equation D f = θ D f .If no such position exists, X i+1 will be assigned ∞, indicating that the cluster A cannot be extended.
To draw a sample from the random set Φ N = {P i : 1 ≤ i ≤ N } ⊂ R 3 × R + , the procedure described above is repeated until X i ̸ = ∞ for all i = 1, . . ., N .
The primary clusters generated in this way then undergo a second aggregation stage, leading to larger, hetero-aggregates which consist of N ′ primary clusters for some integer N ′ ∈ N.
More formally, for some sequence of primary cluster sizes N 1 , . . ., N N ′ , N ′ independent random sets Φ N k is assigned a random position C k in R 3 ∪ {∞} for each k ∈ {1, . . ., N ′ }, ensuring that realizations of the resulting hetero-aggregates are union sets of connected and non-overlapping spheres, which adhere to a preset fractal dimension D f = θ D f .In the following, the cluster Φ (k) N k which has been shifted by a (random) displacement vector C k is denoted by Φ (k) is assigned a (random) label L k which can be equal to 0 or 1, determining whether the cluster consists of WO 3 or TiO 2 .The clusters of label 0 have a size of θ 0 whereas the clusters of label 1 have a size of θ 1 .These cluster sizes θ 0 , θ 1 ∈ {1, . . ., 6} are a further model parameter.The labeled version of Φ (k) Here, the random variables L 1 , . . ., L N ′ , modeling the labels of the primary clusters, are independent and Bernoulli-distributed with P(L k = 1) = for each k ∈ {1, . . ., N ′ }.Note that the label of a primary cluster does not only determine its material but also its size.Specifically, the size N k of the k-th primary cluster is given by N k = θ 0 + L k (θ 1 − θ 0 ) for each k ∈ {1, . . ., N ′ }, i.e., a cluster has a size of θ 0 if its label is equal to 0, and θ 1 otherwise.For sufficiently large N ′ ∈ N, according to the law of large numbers, these definitions of L 1 , . . ., L N ′ and N 1 , . . ., N N ′ ensure that the mixing ratios ρ of heteroaggregates drawn from the stochastic 3D model Ψ N ′ are approximately equal to the preset value θ ρ .
The random displacement vectors C 1 , . . ., C N ′ that describe the positions of primary clusters in the hetero-aggregate model Ψ N ′ are again defined recursively to ensure that the particles of the random hetero-aggregate are connected and non-overlapping and that the fractal dimension θ D f is maintained.More precisely, C 1 is put to (0, 0, 0) and, given that is the set of all cluster positions c ∈ R 3 for which the set A 1 ∪ (A 2 + c) represents a heteroaggregate of connected and non-overlapping particles with fractal dimension D f being equal to the preset value θ D f ∈ (1, 3).
The resulting hetero-aggregate model Ψ N ′ which is described by the model parameters θ D f , θ ρ , θ 0 , θ 1 can now be used to generate virtual aggregates.These aggregates consist of N ′ primary clusters with a fractal dimension of θ D f , an expected mixing ratio θ ρ , and have a label-dependent clustering properties mainly influenced by θ 0 and θ 1 .Moreover, the model parameters have a multivariate influence on further structural descriptors, e.g. the hetero-coordination number, see Section 3.4.In theory, this can be achieved by drawing samples from Ψ N ′ under the condition that (∞, r) ̸ ∈ Ψ N ′ for all r ∈ R + .However, due to computational limitations, this procedure can only be performed in an approximate sense.In the following section, this will be explained in detail.

Generation of virtual hetero-aggregates
The recursively defined models of primary clusters and hetero-aggregates described above can be used to construct algorithms for drawing samples from these models.More precisely, the simulation starts by selecting an initial particle (or cluster), to which particles (or clusters) are added sequentially.Each additional particle (or cluster) is assigned a random radius (or label) and placed at a uniformly sampled random position in L (or L), to be added to the existing cluster (or aggregate).This procedure is iterated until a desired cluster (or aggregate) size is reached.In the following, the desired size of each aggregate is independent, uniformly selected from the range {20, . . ., 80}.
The sets L, L ⊂ R 3 in the stochastic 3D model, from which particle (or cluster) positions are uniformly sampled in order to generate aggregates with a given fractal dimension, are only implicitly defined.Therefor, uniform sampling on L and L is computationally expensive.
To enable efficient uniform sampling from both L(A, r i+1 ) and L(A 1 , A 2 ), the radii of the particles in the sets A ∪ {p i+1 } and A 1 ∪ A 2 are temporarily replaced by their respective arithmetic mean.Note that this replacement is used exclusively when calculating the fractal dimension D f within the definitions of L and L. Thus, all permissible positions for the center of mass of the added particle (or cluster) are located on the surface of a sphere around the center of mass of the cluster (or aggregate), the radius d of which is given by where N A and N C denote the number of particles in the aggregate and the cluster to be added, respectively, R A and R C are their respective radii of gyration, introduced in 4, and a and k f are the quantities used in the definition of D f given in Eq. ( 5), see also [8].Since uniform sampling on the sphere surface can be performed efficiently, by means of rejection sampling, uniform sampling from the modified sets L or L can be done much faster.This procedure results in aggregates with fractal dimensions randomly distributed around the target value θ D f .For further details on the distribution of the fractal dimension D f (A) of an aggregate A generated by this model, see Section 2.2.1.For data acquisition the four model parameters θ D f , θ ρ , θ 0 , θ 1 of the stochastic hetero-aggregate model are systematically varied, by name, the parameters regarding the fractal dimension θ D f , the intended mixing ratio θ ρ and the primary cluster sizes θ 0 and θ 1 of the two materials.In this manner a broad spectrum of aggregates is obtained, which differ not only in preset model parameters used for their generation but also in structural descriptors like the ones listed in Table 1.The fractal dimension of TiO 2 -WO 3 hetero-aggregates, which form by diffusion-limited cluster-cluster-aggregation, is expected to approach the value of D f = 1.5 for particles with dispersed sizes [6], and D f = 1.78 for monodispersed particles [18].Furthermore, the fractal dimension is expected to increase, when particles start to sinter at their contact points [7].In order to create a large database of differently structured virtual hetero-aggregates and their corresponding STEM images, the model parameter θ D f was varied in the present work from θ D f = 1.5 to 2.5 in steps of 0.1.The intended mixing ratio θ ρ was varied from θ ρ = 0.1 to 0.9 in steps of 0.1 and the primary cluster sizes θ 0 and θ 1 were chosen between one and six in steps of one for both materials, see also Table 2. Some examples of virtual hetero-aggregates for various values of the model parameters θ D f , θ ρ , θ 0 , θ 1 are visualized in Figure 2.   The particle sizes are drawn from a log-normal distribution with parameters µ and σ as defined above.Some primary clusters, as defined in Section 2.1.1,are highlighted.

Simulation of STEM intensities
After generating virtual hetero-aggregates, reference simulations to calculate the high-angle annular darkfield (HAADF)-STEM intensity of TiO 2 and WO 3 are conducted as a function of the sample thickness and material density.For that purpose, multi-slice simulations in the frozen-lattice approach [41] with the STEMSIM software [35] were performed.Simulations were done for the rutile and anatase phases of TiO 2 as well as for gamma and delta phases of WO 3 .Crystal parameters and Debye-Waller factors were taken from [5,14,24,38], and elastic atomic scattering amplitudes from [21] were used.The HAADF-STEM intensity for microscope parameters equal to those one would use in experiments with a ThermoFisher 60/300 Spectra microscope were simulated.This machine is equipped with a Cs-corrector for the probe forming system, an X-FEG and SuperXG2 EDXS detectors.A semi-convergence angle of β = 21.1 mrad and an acceleration voltage of 300 kV were set.The simulated HAADF-STEM intensity was obtained by integration of electrons scattered into the annular range between 55 mrad and 250 mrad after application of a detector specific sensitivity curve [35].
The HAADF-STEM intensity further depends on the orientation of the crystal with respect to the electron beam.To account for this effect, various orientations for each material and phase of the crystal were simulated.
Therefore, the crystal was systematically tilted in nine equal steps from a [100]-towards a [010]-viewing direction.In addition, a random tilt was simulated.The final result is a data set with the HAADF-STEM intensity as a function of the sample thickness for TiO 2 and WO 3 , each in two different crystal phases, each with ten orientations of the crystal with respect to the electron beam.

Generation of realistic STEM images
The third step combines the HAADF-STEM reference simulations described above with the virtual 3D hetero-aggregates.STEM images show 2D projections of the aggregates, see Figure 3. Therefore, the projections of the individual particles along one direction are computed, as usual in electron microscopy, the electron beam direction and hence the projection direction is referred to as zdirection.This results in thickness maps for the individual particles.Using the reference simulations, these thickness maps are translated into maps of the HAADF-STEM intensities.To this end, for each particle, the reference simulation of the respective material was chosen in a random phase and a random orientation of the crystal with respect to the electron beam.
In an aggregate, which extends several tens of nanometers in z-direction, not all particles appear in focus.Only particles with centers located at height z = 0 nm are in focus as the electron beam is focused on this plane.To account for this effect, each HAADF-STEM map of the individual particles is convolved with a Gaussian kernel.More precisely, for a particle located at height z, the standard deviation σ STEM of the Gaussian kernel with which the corresponding HAADF-STEM map is convoluted is chosen as σ STEM = |z| • tan(β), where β = 21.1 mrad is the semi-convergence angle, assuming a conical beam shape.Then, blurred HAADF-STEM maps of individual particles are summed up to obtain the artificial HAADF-STEM image of the hetero-aggregate.Finally, shot noise according to a typical electron dose of 149 electrons/ Å2 [23] and scan noise according to a possible typical beam displacement of 0.01 nm [17] were applied.

Statistical analysis and processing of simulated data
In this section the need for the usage of neural networks is explained, addressing some problems connected with the reconstruction of preset values of the model parameters θ D f , θ ρ , θ 0 , θ 1 , based on virtual aggregates drawn from the stochastic 3D model.Further complications in this reconstruction task arise when using 2D STEM data instead of the full 3D geometry of the aggregates, through a loss of information.Therefore, it is explained how image processing methods can be used to simplify the extraction of information from STEM data.

Estimating the parameters of the stochastic 3D model
One of the challenges associated with predicting the parameters of the stochastic 3D model from STEM images is that some model parameters are even imperceivable from the 3D structure of a virtual aggregate from which the corresponding STEM image is determined.This is due to the simplifying assumptions made within the simulation process of the 3D model and its stochastic nature, see Section 2.1.2,resulting in empirical values of the model parameters slightly differing from the preset ones.
For example, the fractal dimension D f computed by means of Eq. ( 5) for a virtual hetero-aggregate A might differ from the preset value of the model parameter θ D f .More precisely, in the simulation of hetero-aggregates, the radii r 1 , . . ., r N of particles considered in Eq. ( 5) are replaced by their arithmetic mean (r 1 +. ..+rN )/N , see Section 2.1.2.Also, the mixing ratio ρ of an aggregate A computed by means of Eq. ( 3) can deviate from the model parameter θ ρ , due to the randomly chosen labels of primary clusters, modeled by the Bernoullidistributed random variables L 1 , . . ., L N ′ .For example, the first aggregate in Figure 2 has an expected (preset) mixing ratio of θ ρ = 0.3, but the actual mixing ratio ρ computed from Eq. ( 3) is ρ = 9 41 ≈ 0.22.Recall that, in order to distinguish between these quantities, the vector of model parameters used to generate A is referred to as θ = (θ D f , θ ρ , θ 0 , θ 1 ), while D f (A) and ρ(A) describe the empirical fractal dimension and mixing ratio of the aggregate A, computed from Eqs. ( 5) and (3), respectively.
Figure 4 illustrates the discrepancy between preset model parameters and the empirical fractal dimension and mixing ratio of virtual aggregates, computed from Eqs. ( 5) and (3), respectively.Nevertheless, Figure 4 indicates that, on average, the structural descriptors D f and ρ nicely coincide with the preset model parameters θ D f and θ ρ .Therefore, rather than attempting to determine the model parameters θ D f , θ ρ , θ 0 , θ 1 from a single aggregate A, a family B = {A 1 , . . ., A ν } of ν > 1 aggregates is used instead, called batch in the following.More specifically, it is expected that choosing a larger batch size would yield more accurate results, but at an increased cost.5) and ( 3), respectively.The other model parameters were chosen at random from their respective ranges, as introduced in Section 2.1.2.To improve clarity, kernel density estimation was used to assign colors to the scatter points computed for 19 440 realizations of the stochastic 3D model.
We were not able to find any scalar features that can be utilized to predict the model parameters θ 0 and θ 1 associated with the cluster size used in the generation of virtual aggregates, i.e., in the cluster-cluster-model introduced in Section 2.1.1.For example, in order to predict the model parameter θ 0 , an obvious choice for such a scalar feature would be to describe the average size of observable WO 3 clusters, where an observable cluster is an inclusion maximal homogeneous subset C ⊂ A of an aggregate A, i.e., there is no larger homogeneous subset C ′ ⊂ A such that C ⊂ C ′ .These clusters can differ from the primary clusters used in the construction algorithm described in Section 2.1.1.Specifically, the observable clusters are formed by unions of primary clusters, whereas, contrary to the latter ones, the observable clusters are recognizable in the 3D data, see Figure 2 for a visualization.
However, this average (observable) cluster size can not be used to predict θ 0 .Figure 5 shows that there are various specifications of model parameters that differ in θ 0 and, nevertheless, yield similar average cluster sizes of WO 3 particles.The prediction of the model parameter vector θ is further complicated by the fact that only the 2D STEM image can be utilized which may not perfectly inform the 3D morphology of A.
To predict the preset vector of model parameters θ from a family B of aggregates using only their simulated STEM images, CNNs are initially utilized to extract relevant features from these images.These features are subsequently utilized to predict the preset model parameters, see the schematic description of this workflow shown in Figure 1b.While the process of extracting features from the STEM images remains largely consistent across all model parameters, the calculation of the estimators for θ D f , θ ρ , θ 0 , θ 1 exhibits significant variations, see Sections 2.3.2-2.3.5 below.For instance, when estimating the model parameters θ D f and θ ρ , the features computed from a STEM image I of an aggregate A are scalar values that approximate D f (A) and ρ(A).Then, the arithmetic mean of the respective image-wise features of a family B = {A 1 , . . ., A ν } of aggregates is used as estimators θ D f and θ ρ for θ D f and θ ρ .In contrast, when predicting the model parameters θ 0 and θ 1 , a neural network is employed to identify high-dimensional features from which the estimators θ 0 and θ 1 for θ 0 and θ 1 are computed, see Section 2.3 below.

Data processing and augmentation
Various common image processing methods are used to simplify the extraction of information from STEM images.In particular, the pixel intensity values of STEM images are linearly scaled to the entire range of [−0.5, 0.5] and rounded to 256 equidistant values in order to achieve faster convergence to a lower error during the training process.More specifically, the scaling centers the pixel intensity values around zero [32], whereas the rounding reduces the noise of the images.Note that this procedure is performed on all STEM images, even if not explicitly mentioned, whereas the subsequent preprocessing steps will only be applied during training.
Overfitting is a common problem where neural networks achieve good results on training data but perform rather poorly when applied to previously unseen data.This can occur when the model learns irrelevant information within the dataset.As a result, the model fits too closely to the training set and becomes overfitted, making it unable to generalize well to new data.To address this issue, augmentation of training data is used.In the context of the present paper, this means that the input data is randomly modified in each training step, such that during each step of the training procedure the network is provided with input data which differs from the input data of previous steps.Therefore, a significantly larger number of training steps can be conducted while still providing the neural network with novel training data in each step, and thus, avoiding overfitting.
Note that there is a wide variety of possible methods for modifying input data which are commonly used in training data augmentation, e.g., rotation, reflection, radial transformation, elastic distortion [37] and random erasing [36].However, in order to preserve certain structural descriptors of aggregates observed in image data, like shape and size descriptors of particles, only random rotations, reflections and small displacements are used for training data augmentation.

CNN-based approach for the prediction of model parameters
The goal of this section is to introduce the CNN-based methodology for predicting the model parameters θ D f , θ ρ , θ 0 , θ 1 of the hetero-aggregate model from (simulated) STEM images.Due to computational constraints, it was not feasible to generate the required number of aggregates for each possible preset of the model parameters θ D f , θ ρ , θ 0 , θ 1 .Therefore, in order to ensure robust training, the focused was on generating 100 aggregates for each triple (θ ρ , θ 0 , θ 1 ) in {0.1, . . ., 0.9} × {1, . . ., 6} × {1, . . ., 6}, as these parameters exhibited interactive effects that were crucial for our study.More specifically, for each such triple, two values θ D f of θ D f were chosen at random from {1.5, . . ., 2.5} and each resulting model parameter preset (θ D f , θ ρ , θ 0 , θ 1 ) was used to generate 50 aggregates.After applying the STEM simulation described in Section 2.1, this results in a set G = {(A i , I i , θ i ) : 1 ≤ i ≤ 32 400} of 32 400 triplets of 3D aggregates A i , corresponding STEM images I i , and vectors of preset model parameters θ i = (θ D f ,i , θ ρ ,i , θ 0,i , θ 1,i ).The set G is thereafter split into two datasets, one for training and one for evaluation.
For both, training and evaluation, batches will be used for some ν > 1, that are generated by the same preset of model parameters, i.e., θ i1 = . . .= θ iv .To ensure the availability of such batches, the split of G is done such that there is no model parameter configuration that occurs less than 20 times in neither the data used for training nor in the data used for evaluation.These two datasets will be referred to by their respective index sets T (for training) and E (for evaluation), where T ∪E = {1, . . ., 32 400} with #T = 19 440 and #E = 12 960.
In the following, it is explained how the triplets (A i , I i , θ i ) are used to generate pairs of image data and ground truth labels, which will be utilized for the training of the neural networks.First, general aspects of network architecture and training are presented and, then, some specifics regarding the prediction of each of the four model parameters θ D f , θ ρ , θ 0 , θ 1 are given.

Network architecture and training
The networks used to extract features are all based on the same basic network architecture, regardless of the model parameter being predicted.This network architecture consists of stacked convolutional layers with a kernel size of 3 × 3, batch normalization layers [15], the ReLu activation function, given by and max pooling layers with a kernel size of 2 × 2, followed by fully connected layers.The basic architecture of the convolutional neural networks considered in the following has the form i.e., it is represented as the composition of two subnetworks, f and g.The subnetwork f consists of the convolutional part of the basic network architecture, a flatten layer and two dense layers with a final output dimension of 112.The subnetwork g consists of two dense layers with a final output dimension of 1.A schematic representation of the network architecture is given in Figure 6 (left), whereas details regarding the this architecture are provided in Table 3.
To achieve a high prediction quality, the parameters of the neural networks have to be adopted.This will be done supervised.More precisely, the dissimilarity between the ground truth, denoted as y = (y 1 , . . ., y n ), and the network output y = ( y 1 , . . ., y n ), i.e., y 1 = CNN(x 1 ), . . ., y n = CNN(x n ) for some input x = (x 1 , . . .x n ), n > 1, will be minimized.For example, when predicting the fractal dimension θ D f , the input x of the network consists of STEM images I 1 , ..., I n and the ground truth is given by the vector of fractal dimensions of the respective aggregates A 1 , ..., A n , i.e. y = (D f (A 1 ), ..., D f (A n )).The comparison between the ground truth and the prediction is done in terms of the mean square error (MSE), given by The resulting loss MSE(y, y) is minimized by a gradient descent method using an Adam optimizer [13,20] with a learning rate of 0.0001, where the value of n in Eq. ( 12) determines the number of network evaluations before a step of the gradient descent method is applied.These evaluations are done on the training data, given by the index set T , where n is put to 16 when predicting θ D f or θ ρ , and n = 8 otherwise.The general network architecture described above and the prediction procedure will be slightly adapted for each of the four model parameters θ D f , θ ρ , θ 0 and θ 1 .In the following, detailed explanations will be provided regarding these parameter-specific adaptations.

Fractal dimension
The fractal dimensions D f (A i1 ), . . ., D f (A iν ) of the aggregates A i1 , . . ., A iν in a batch B, as introduced in Eq. ( 9), are typically symmetrically distributed around the preset value of θ D f , which will be denoted by θ D f (B) in the following, see could be used as an estimator for θ D f (B).However, since the fractal dimensions D f (A i1 ), . . ., D f (A iν ) cannot be directly determined from the STEM images I i1 , . . ., I iν , approximations D f (I i1 ), . . ., D f (I iν ) are used instead.These approximations are computed by a convolutional neural network CNN D f , where the STEM images I i1 , . . ., I iν are used as input.Thus, finally, the estimator The architecture of the neural network CNN D f coincides with the one described in Section 2.3.1.The activation function of the output layer is a scaled sigmoid function.This kind of activation function is a standard choice for NNs with bounded outputs.More precisely, the activation function is given by γ(x) = α 1 1+e −x + β for x ∈ R, where α = 1.4 and β = 1.3 are selected to ensure that the network can represent the expected range of values for D f , with added tolerances on each side of the expected range, see Figure 4a.Note that the input of the network during training consists of augmented versions a(I i ) of the STEM images I i for i ∈ T , i.e., images that arise from I i by reflecting, rotating and displacing, as described in Section 2.2.2.The corresponding supervisory signal consists of the fractal dimension of the corresponding aggregates.Hence, the network training is conducted using pairs (a(I i ), D f (A i )), for i ∈ T .

Mixing ratio
In Figure 4b the distribution of the mixing ratio of aggregates in dependence of the model parameter θ ρ is visualized.From there, it is evident that the mixing ratios ρ(A i1 ), . . ., ρ(A iν ) of aggregates A i1 , . . ., A iν within a batch B, generated by the 3D model with a preset value of θ ρ (B), follow a distribution the mean of which is approximately equal to θ ρ (B).
This suggests using a similar approach as described above in Section 2.3.2.However, note that there are some aggregates with a mixing ratio ρ(A i ) being equal to 0 or 1.A neural network with an architecture as that of CNN D f does not reflect these discrete values properly.Therefore, the prediction procedure for the mixing ratio is slightly modified by initially classifying whether an image I i depicts an aggregate with mixing ratio of exactly 0 or 1, using a classification network CNN ρ class , and afterwards predicting the mixing ratio of the corresponding aggregate, using a regression network CNN ρ reg .For this purpose, the networks CNN ρ class and CNN ρ reg , having the same basic network architecture as described in Section 2.3.1 and a commonly used [39] unscaled sigmoid function γ(x) = 1 1+e −x for x ∈ R as activation function in the output layer, are trained for the respective tasks.The training of the regression network CNN ρ reg is done on pairs (a(I i ), ρ(A i )), i ∈ T , of augmented STEM images and corresponding ground truth mixing ratios, whereas the training of the classification network CNN ρ class is done on pairs of augmented STEM images and corresponding binary class labels, where a class label of 0 or 1 identifies the corresponding aggregate as heterogeneous or homogeneous, respectively.
However, it is a well-known problem that number-wise imbalanced classes can lead to poorly performing classifications since classifiers tend to neglect the underrepresented classes, also known as imbalance problem [29].To address this issue, the augmented STEM images of homogeneous aggregates, which account for about 10% of all images, were oversampled in the training procedure of the classifier to achieve balanced classes.
Finally, to predict the mixing ratio of an aggregate via its STEM image, the outputs of CNN ρ class , which identifies homogenous aggregates, and CNN ρ reg , which determines the mixing ratio, are combined.More specifically, for a STEM image I, the predicted mixing ratio ρ(I) of the corresponding aggregate is given by where η : [0, 1] → {0, 1} is the function that rounds a number x ∈ [0, 1] to its closest integer η(x) ∈ {0, 1}.This results in the estimator θ ρ (B) for the preset model parameter θ ρ (B) of a batch B = {(A i1 , I i1 , θ i1 ), . . ., (A iv , I iv , θ iv )}, given by

Size of primary WO 3 clusters
In the procedures for predicting the model parameters θ D f and θ ρ , described above, the process of determining an estimator involved the identification of a scalar feature that describes an aggregate property, namely, the fractal dimension D f and the mixing ratio ρ, that is predominantly influenced by the corresponding model parameter.This scalar feature can be directly computed from the virtual 3D aggregates, and thus, it is possible to predict it from the corresponding 2D STEM images.Consequently, using this scalar feature, formulas for estimating the model parameter from this property has been derived, see Eqs. ( 13) and ( 14).Since the model parameter θ 0 is designed to control the cluster sizes of WO 3 particles for the cluster-cluster-aggregation model introduced in Section 2.1.1,such a property should relate to the number of connected WO 3 particles.However, the sizes of observable clusters are not only influenced by θ 0 but also by θ ρ .On the one hand, larger values of θ 0 lead to larger primary cluster sizes of clusters of label 0 and thus larger observable clusters.Lower values of θ ρ lead to larger proportions of primary clusters of label 0. Therefore, it is more likely that two primary clusters that are in contact, share the material label 0, and thus, the expected size of observable clusters of label 0 increases, see Figure 5.
This makes the average of the observable cluster size on its own an unsuitable property for estimating the model parameter θ 0 .Therefore, one has to search for another feature that is functionally related to θ 0 .Additionally, a functional relationship that suitably maps features derived from STEM images to an estimator of θ 0 may not be captured solely by an average, necessitating the search for another suitable function.However, these two steps can be quite complex and time-consuming if done heuristically.To address this, a datadriven approach utilizing a neural network is adopted.This approach allows us to determine the feature vectors and the formula that relates them to the corresponding model parameter θ 0 .More specifically, the identification of relevant features is conducted by means of part f of the basic network architecture described in Section 2.3.1.The subnetwork f is applied to all images in a batch individually, and the concatenated results are then used as input of part g of the basic network architecture, which is in charge of determining the relationship between the feature vectors determined by f and the model parameter θ 0 .In detail, this results in a modified network, denoted as CNN 0 , which is given by where I(B) = {I i1 , . . ., I iν } denotes the STEM images corresponding to the aggregates A i1 , . . ., A iν in a batch B. Referring to Table 3, the feature vectors of STEM images up to the output of layer 17 are computed as before.Then these feature vectors of a batch are concatenated and used as input of layer 18.
The final output layer uses a ReLu transfer function.The modified network architecture is illustrated on the right-hand side of Figure 6.Note that the approach described above differs from the commonly used technique where a network, denoted as f ′ , takes multi-channel input data, i.e., in our case f ′ (I i1 , ..., I iν ).Such an approach allows the network to detect spatially resolved interdependencies among the images.In contrast, our approach considered in Eq. 15 employs identical CNNs f for dimensionality reduction and feature extraction on each input channel individually.As a consequence, this ensures uniform feature extraction for every input image while also reducing the number of trainable parameters in the CNN.The choice of this approach is rooted in the concept that each image within a batch a priori contains the same information regarding the underlying model parameters, and the lack of spatial interdependence between the images which would be relevant for the prediction of model parameters.
Due to the problem-specific architecture of the network CNN 0 , the training data no longer consists of pairs of individual images and corresponding ground truths.Instead, for each batch B, the training pair ({a(I i1 ), . . ., a(I iν )}, θ 0 (B)) consists of a corresponding batch of augmented images and the underlying model parameter θ 0 (B).

Size of primary TiO 2 clusters
The method used to predict the model parameter θ 1 for the cluster size of TiO 2 particles is similar to the approach described in the previous section.Nonetheless, given that the pixel intensity values of TiO 2 particles in the STEM images closely resemble the background and are significantly lower than those of WO 3 particles, they are considerably more difficult to differentiate by visual inspection.Consequently, it might be plausible that a neural network could also encounter challenges in tasks which depend on the identification of TiO 2 particles.As shown in Figure 7, the neural network CNN 1 achieves unsatisfactory results when using unadjusted image data, which may be due to the difficulty mentioned above.To address this issue, the intensity value p > 0 of non-background pixels in the STEM images is replaced by its multiplicative inverse p modified , i.e., for some threshold t > 0 the modified pixel value is given by This procedure is applied to all STEM images used in the prediction of θ 1 before the preprocessing steps described in Section 2.2.2 are applied.The highlighting effect of this adjustmemt of pixel intensity values is shown in Figure 8.

Results
In this section, the results of the analysis on various aspects of model parameter prediction are presented.To ensure that these results accurately represent the generalization capability of the trained neural networks, all evaluations were conducted on data not used during training.More specifically, recall that the data corresponding to the index set T is used for training, whereas the data corresponding to the index set E is used to evaluate results, see Section 2.3 for details on the training-test split.As a prelude to the main findings, first, the impact of batch size on prediction quality is assessed for all four model parameters θ D f , θ ρ , θ 0 , θ 1 .For that purpose, Figure 9 illustrates how the batch size affects the quality of the predictions with respect to the mean absolute error (MAE), defined as where n > 0 is the number of predictions and y = ( y i ) i=1,...,n are the predictions of the ground truth values y = (y i ) i=1,...,n .Note that the mean absolute error given in Eq. ( 16) is more robust to outliers and yields more easily interpretable values compared to the mean squared error considered in Section 2.3.As expected, it can be observed that larger batch sizes lead to better predictions.However, no significant improvement is observed for values exceeding 10.Thus, the results presented below, which were computed with a fixed batch size of ν = 12, can be considered representative for the presented methodology.

Fractal dimension
The accuracy of the estimator θ D f for θ D f depends on two key properties.First, the mean error of the single STEM image predictions CNN D f (I) should be centered around zero, since otherwise a bias could be propagated through the averaging procedure and therefore bias the estimator θ D f , see Eq. ( 13).Second, the variance of the single image prediction error should be low, so a low variance estimator can be achieved even with a small batch size ν.
In Figure 10a the error for the predicted fractal dimension D f is shown.As desired, the error of the network output exhibits a small absolute value for the bias and a low variance, as indicated by a mean value of -0.006 and interquartile range of 0.118.As the network output is a suitable basis for predicting the model parameter θ D f , the estimator θ D f achieves an MAE of 0.041, see Figure 10b.The network tends to slightly overestimate the fractal dimension of the depicted aggregates for small preset values of θ D f and underestimate it for large ones.This behavior is further pronounced in the estimator θ D f .For a possible explanation of this trend, see Section 4 below.

Mixing ratio
To evaluate the accuracy of the estimator for the model parameter θ ρ , first, the image-wise straightforward case is considered, where the output CNN ρ reg (I) of the regression network is used as an estimator for the mixing ratio ρ(A), without considering the classification network.As shown in Figure 11a, the output CNN ρ reg (I) of the regression network exhibits a relatively high bias for aggregates A such that ρ(A) ∈ [0, 0.1] or ρ(A) ∈ [0.9, 1], with biases of about 0.04 and −0.1, respectively.
To address this issue, in Section 2.3.3 a procedure which utilizes an additional classification network CNN ρ class is presented.In Figure 11b, the resulting imagewise error of ρ using this procedure is shown.It is evident that the error of ρ is significantly reduced for homogenous aggregates.More precisely, the bias of ρ(I) for aggregates A with ρ(A) ∈ [0, 0.1] or ρ(A) ∈ [0.9, 1] decreases to about 0.009 and 0.02, respectively.Incorporating the additional network, the MAE of the image-wise predicted mixing ratio ρ(I) of an aggregate A decreases from 0.059 to 0.053.Consequently, the MAE of the batch-wise prediction θ ρ of θ ρ improves significantly, reducing from 0.027 to 0.017.
Note that the diagonally arranged points in Figure 11b are due to a small number of falsely classified heterogeneous aggregates, whereas the significantly thinned vertical lines are due to correctly classified homogeneous aggregates.Table 4: Confusion matrix of the homogeneous-heterogeneous classification task.About 95% of the aggregates are correctly classified.

Sizes of primary WO 3 clusters and primary TiO 2 clusters
Figure 12a shows the difference between the network output CNN 0 (I(B)) for the STEM images I(B) = {I i1 , . . ., I iν } corresponding to the aggregates A i1 , . . ., A iν in a batch B (given in Eq. ( 15), i.e., prior to rounding of the output which would result in the estimator θ 0 ) and the preset value θ 0 (B) of the model parameter θ 0 .Figure 12b shows the error distribution of θ 0 after rounding, where in about 48% of all cases the value of θ 0 coincides with θ 0 .Additionally, in more than 92% of the cases, the error of θ 0 is less than or equal to 1.Although the largest mean absolute error occurs in the case of θ 0 = 6, the resulting inaccuracy corresponds to an average relative error of about 20%.The quality of the estimator θ 1 introduced in Section 2.3.5 is similar to that of θ 0 , see Figure 13.After rounding the output of the network CNN 1 , 32% of the predictions coincided with the preset values of θ 1 .In about 82% of the cases, an error less than or equal to 1 occurred.The mean absolute error for θ 1 = 6 is equal to 1.44, where the resulting inaccuracy corresponds to an average relative error of about 24%.

Further structural descriptors of hetero-aggregates
Recall that the goal of the method presented in this paper is to generate realistic digital shadows of hetero-aggregates in 3D, solely from observations provided by 2D STEM images of the aggregates.For that purpose, the parameters θ D f , θ ρ , θ 0 , θ 1 of the stochastic 3D model introduced in Section 2.1.1 are predicted in order to specify the model configuration with which to generate digital shadows.However, so far, only the accuracy of the predictors θ D f , θ ρ , θ 0 , θ 1 for θ D f , θ ρ , θ 0 , θ 1 was evaluated, rather than investigating further structural descriptors of hetero-agggregates in order to evaluate the structural similarity between the resulting digital shadows and the original hetero-aggregates, i.e., the aggregates which were used for predicting the model parameters θ D f , θ ρ , θ 0 , θ 1 .Moreover, many structural properties of the digital shadows are influenced by multiple model parameters, and thus, evaluating the quality of the four predictors θ D f , θ ρ , θ 0 , θ 1 separately is not sufficient.Therefore, three further structural descriptors, which characterize the 3D morphology of hetero-aggregates and have not yet been considered in this paper, are investigated in order to assess the similarity between original aggregates and corresponding digital shadows, see also Figure 1c-d.

Average cluster size and coordination numbers
The average cluster size S TiO 2 (A) of TiO 2 particles of an aggregate A = {p i = (x i , r i , l i ) : describes the average cardinality of clusters of connected TiO 2 particles in A. It is given by where C TiO 2 (A) denotes the set of all TiO 2 clusters in A. While the value of S TiO 2 (A) is primarily influenced by the preset values of θ 1 and θ ρ , the value of θ 0 also has some (minor) influence on S TiO 2 (A) through its appearance in the definition of the Bernoulli-distributed labels L k of the stochastic 3D model, see Section 2.1.1.Furthermore, the so-called average hetero-coordination number Z hetero (A) of an aggregate A is considered, which is given by where #A(= N ) is the total number of particles in A. Thus, Z hetero (A) is the average number of contacts of particles in A with particles of the other material.Finally, the average coordination number Z total (A), given by is considered, which is the average number of contacts of particles in A to other particles, regardless of their material.Since the number of contacts of a particle within an aggregate A strongly depends on the shape of A, the model parameter θ D f significantly influences the values of the descriptors Z hetero (A) and Z total (A).Further, Z hetero (A) tends to increase with θ ρ close to 0.5 and decreasing primary cluster sizes determined by θ 0 and θ 1 .

Comparison of original hetero-aggregates and their digital shadows
To evaluate the quality of the predictor θ = ( θ D f , θ ρ , θ 0 , θ 1 ) in terms of the structural descriptors introduced in Section 3.4.1,50 configurations of θ = (θ D f , θ ρ , θ 0 , θ 1 ) were selected at random, out of the index set E of evaluation data.For each of these numerical specifications of θ, 800 new aggregates A 1 , . . ., A 800 were drawn from the corresponding stochastic 3D model, and their structural descriptors S TiO 2 (A i ), Z hetero (A i ) and Z total (A i ) for i ∈ {1, . . ., 800} were computed.Furthermore, for each case, the (preset) groundtruth parameter vector θ has been estimated using the methods explained in Section 2.3.Then, for each of the 50 specifications of θ, 800 additional aggregates A ′ 1 , . . ., A ′ 800 and computed their structural descriptors S TiO 2 (A ′ i ), Z hetero (A ′ i ) and Z total (A ′ i ) for i ∈ {1, . . ., 800} were generated.Figure 14 visualizes the distributions of these structural descriptors for four numerical specifications of θ, where the aggregates A 1 , . . ., A 800 and A ′ 1 , . . ., A ′ 800 were generated using either the preset parameter vector θ (blue) or its prediction θ (orange), respectively.Figure 14: Distribution of the structural descriptors S TiO 2 (A i ) and S TiO 2 (A ′ i ) (left column), Z hetero (A i ) and Z hetero (A ′ i ) (middle column), as well as Z total (A i ) and Z total (A ′ i ) (right column) of the original aggregates A i (blue) and their digital shadows A ′ i (orange), for four numerical specifications of θ and their predictions θ.For computing the histograms, 20 equidistant bins have been employed which span the entire range of respective values on the x-axis.
Note that the gaps in the histograms of the average coordination numbers Z total (A i ) and Z total (A ′ i ) (right column) are due to the limited size of the considered aggregates, see Section 2.1.2.More specifically, the average coordination numbers Z total (A i ) and Z total (A ′ i ) given in Eq. ( 19), of aggregates A i , A ′ i with sizes smaller than or equal to 80, can only take values in the set where H ∩(1.975, 2) = ∅ because of the limited denominator q 2 on the right-hand side of Eq. (20).
Furthermore, note that the predictor θ for θ displayed in the top row of Figure 14 has a much smaller mean absolute error than the one displayed in the second row.Nevertheless, the latter (blue and orange) histograms show a higher agreement than those in the top row of Figure 14.Meaning that a high degree of similarity (in terms of MAE) of θ and θ does not necessarily imply a high degree of similarity of the resulting descriptor distributions.
We quantitatively analyzed this discrepancy between the distributions of the structural aggregate descriptors resulting from the preset configuration of model parameters and their prediction.For that purpose, the absolute difference of the means of these pairs of distributions were computed .For example, the mean values of S TiO 2 (A i ) and S TiO 2 (A ′ i ) (vertical lines) in the top row of Figure 14 are equal to 5.38 and 11.00 for the preset parameter vector θ and its prediction θ, respectively.This results in an absolute error of 5.62.Over all 50 pairs of θ and θ, a MAE error of 2.165 is achieved, see also Table 5, where the MAEs for all three structural descriptors considered in this section are given as well as the corresponding coefficient of determination R 2 defined as Here, the vectors y = (y 1 , . . ., y 50 ), y = ( y 1 , . . ., y 50 ) ∈ R 50 consist of the mean values of the distributions of the given aggregate descriptor computed for the 50 preset specifications of θ and their predictions θ.More precisely, for j ∈ {1, . . ., 50}, y j = 1 800 800 i=1 γ(A ij ) and y j = 1 800 800 i=1 γ(A ′ ij ), where γ stands for either S TiO 2 , Z hetero or Z total , and A ij , A ′ ij denote the i-th aggregate drawn from the j-th specification of θ and its prediction θ, respectively.Furthermore, y =

Discussion
The analysis of image data in order to determine the fractal dimension of finite aggregates has been a popular approach for some time.Two commonly used methods for this purpose are the box counting and sandbox methods, which are relatively simple image analysis tools [2].These methods can provide meaningful structural information, but the quality of the results is highly dependent on the quality of the images.Specifically, high contrast and resolution are necessary to obtain clear STEM images from which accurate structural information can be extracted.However, in cases where a high fractal dimension is present, i.e., D f (A) > 2, these classical methods have to be adopted to avoid problems with geometric opacity.There are attempts to solve this problem under certain conditions, see [26].Although this difficulty can be observed in the slightly decreasing accuracy for values of D f > 2.2, which has been obtained by the CNN-based approach proposed in the present paper, a satisfying accuracy was achieved even for high fractal dimensions, as shown in Figure 10a.Furthermore, the CNN approach works well independently of the aggregate size, see Figure 15a.
Probably the most comparable conventional method for determining the mixing ratio of an aggregate via its 2D STEM image, is based on determining the particle label of each pixel using a threshold value.More specifically, depending on the pixel intensity, the pixel is classified as TiO 2 , WO 3 or background, and then, using the a priori known particle size distributions, a mixing ratio can be predicted.However, since the representation of thick TiO 2 particles or of many overlapping TiO 2 particles can have the same pixel intensity values as the representation of thin WO 3 particles, this threshold approach has a large source of errors [11].The best appearing thresholds using a "brute force" algorithm on a representative data were determined.This results in an MAE of 0.078 per aggregate when estimating the mixing ratio.Compared to the MAE of 0.053, see Section3.2, of the CNN approach described in the present paper, the error increases by 40% for the thresholding method described above.This is likely due to the increased values of pixel intensity which are caused by overlapping particles (see Figure 8), where these pixels with increased intensity values tend to be classified as WO 3 .Therefore, conventional threshold methods become increasingly inaccurate with an increasing number of overlapping particles, contrary to the behavior of the CNN approach proposed in the present paper, see Figure 15b.Regarding the prediction of the remaining two model parameters θ 0 and θ 1 , as far as we know, there is no comparable conventional method based on 2D image data.Such methods, if they do not consider depth information, would not be able to recognize if overlapping particles are touching or not, and thus, it is unlikely that they can accurately predict the values of θ 0 and θ 1 .
Recall that the objective of the present paper is to generate digital shadows that are stochastically equivalent to the ground-truth aggregates used for model fitting.These digital shadows, which have known a 3D structure, can then be employed to predict the structural properties of the ground-truth aggregates at significantly reduced costs.Therefore, rather than just evaluating the accuracy of the predicted model parameters θ = ( θ D f , θ ρ , θ 0 , θ 1 ), the morphological similarities of the resulting digital shadows and their ground truth in terms of further structural descriptors, i.e., average clusters sizes and coordination numbers were also investigated.As already mentioned in Section 3.4.2, the MAE of θ is no appropriate tool to evaluate the similarity of digital shadows and their ground-truth aggregates.For instance, an extreme mixing ratio leads to a situation where the precision of either θ 0 or θ 1 has only a negligible impact on the structure of the resulting aggregates due to the corresponding material occurring very rarely.Moreover, the structural similarity of the resulting digital shadows is more strongly affected by small errors and rounding of CNN 0 (I(B)) and CNN 1 (I(B)) when the values of θ 0 and θ 1 are small, as opposed to when they are large.In particular, errors in the prediction of ground truths for small values of θ 0 and θ 1 result in higher relative errors.In such cases, large relative errors seem to have a greater impact on the structural discrepancies observed between aggregates generated for predicted and preset model parameters, see Figure 16.This effect can be further exacerbated by the application of subsequent rounding operations.Although the predictor θ = ( θ D f , θ ρ , θ 0 , θ 1 ) proposed in this paper shows only minor discrepancies across all descriptors listed in Table 1, adapting the model and training process to address the issues mentioned above could enhance the similarity of digital shadows and original aggregates even further.For example, expanding the possible values of θ 0 and θ 1 to the interval [1,6], rather than just considering the discrete set {1, 2, . . ., 6}, would result in a diversity of aggregates, while also avoiding rounding errors that can arise in the prediction of θ 0 and θ 1 .More specifically, this could be achieved by modifying the aggregation model Ψ N ′ introduced in Eq. (7), such that the sizes of the primary clusters are randomly distributed, instead of choosing a constant cluster size.This would achieve a more detailed coverage of possible aggregate structures, especially for small values of θ 0 and θ 1 .The training of CNNs could benefit from an adapted cost function that takes the values of other model parameters into account and assigns weights to errors based on the importance of the ground truth to be predicted.

Conclusion
A method has been developed in order to determine the parameters of a stochastic 3D model for synthetic TiO 2 -WO 3 hetero-aggregates, based on their 2D STEM images.The method relies on convolutional neural networks that utilize distinct problem-specific architectures.If such an appropriately calibrated stochastic 3D model is available, the neural network approach bypasses the need for using traditional microstructure analysis and modeling techniques, which are expensive in time and costs, such as tomographic STEM imaging as well as complex image processing and segmentation.The networks were capable of predicting model parameters that describe fractal dimension, number-wise mixing ratio, and the sizes of primary clusters.The aggregates drawn from the stochastic 3D model with predicted model parameters exhibited almost the same coordination numbers and average cluster sizes as those generated by the model with the original (preset) parameters.
In the present paper, synthetic TiO 2 -WO 3 hetero-aggregates are used as model system, because these two materials show a good material contrast in STEM images.However, only spherical particles were used for the generation of synthetic 3D aggregates.It would be interesting to investigate the effectiveness of the proposed method if particles for the hetero-aggregates are considered, which feature similar STEM intensities but differ significantly in their shape or size.Moreover, since experimentally measured aggregates feature more varied cluster sizes than synthetically generated ones, it can be presumed that a larger variability in cluster sizes would require more comprehensive data sets in order to make accurate predictions, but investigating this effect systematically is still important.Finally, in a forthcoming study, the presented method will be experimentally validated.More precisely, experimentally acquired 3D STEM image data of hetero-aggregates will be analyzed to investigate how well the stochastic 3D model proposed on the present paper can describe real aggregates.

Figure 1 :
Figure 1: Workflow of the training and evaluation procedure.

Figure 2 :
Figure 2: Examples of virtual hetero-aggregates.The labels correspond to the values of the vector θ = (θ D f , θ ρ , θ 0 , θ 1 ) of their model parameters.The WO 3 particles are displayed in blue, while the TiO 2 particles are displayed in orange.The particle sizes are drawn from a log-normal distribution with parameters µ and σ as defined above.Some primary clusters, as defined in Section 2.1.1,are highlighted.

Figure 3 :
Figure 3: Schematic representation of the 3D structure of a virtual heteroaggregate (left) and its respective STEM image (right).The WO 3 particles are colored blue and correspond to the bright particles in the STEM image.The TiO 2 particles are colored orange and correspond to the dark particles.

Figure 4 :
Figure 4: Visualization of empirical probability densities of the fractal dimension D f (A) (left) and mixing ratio ρ(A) (right) of virtual hetero-aggregates, depending on the model parameters θ D f and θ ρ , where the values of D f (A) and ρ(A) are computed by means of Eqs.(5) and (3), respectively.The other model parameters were chosen at random from their respective ranges, as introduced in Section 2.1.2.To improve clarity, kernel density estimation was used to assign colors to the scatter points computed for 19 440 realizations of the stochastic 3D model.

Figure 5 :
Figure 5: Average cluster size of WO 3 particles.Each scatter point is computed over a batch B = {A 1 , . . ., A ν } with ν = 12 for six different specifications of the model parameters θ D f , θ ρ , θ 0 , θ 1 .For three of these specifications, the values of θ = (θ D f , θ ρ , θ 0 , θ 1 ) and the average cluster sizes, corresponding to the red scatter points, are displayed.

Figure 6 :
Figure 6: The basic network architecture represented as a simple composition of subnetworks f and g (left), and an adjusted multi-image input network (right), where the subnetwork f is applied to multiple input images and the outputs are concatenated as a new input for subnetwork g (see also Section 2.3.4 below).

Table 3 :
Details of network architecture.The value of b is put equal to b = 1 for the prediction of the model parameters θ D f and θ ρ , and b = ν for the prediction of θ 1 and θ 0 .Since padding is omitted, each convolutional layer reduces the size of the feature map by two in both dimensions.

Figure 4a .
Figure 4a.Therefore, the mean value D f (B), given by

Figure 7 :
Figure 7: Prediction error of the network CNN 1 for unmodified input images, in dependence of the preset value of the model parameter θ 1 .The prediction quality is comparable to that of a constant prediction, where CNN 1 (I(B)) = 3.5.The results have been obtained on a randomly chosen subset of the training data set.

Figure 8 :
Figure 8: Effect of pixel intensity modification.The image on the right-hand side is obtained after replacing the intensity values of non-background pixels (shown on the left-hand side) by their multiplicative inverse, where a treshhold of t = 0.001 is used.

Figure 9 :
Figure 9: Quality of the estimators for θ D f , θ ρ , θ 0 , θ 1 , in dependence of the batch size ν.Due to different orders of magnitude, the error curves for θ D f , θ ρ and θ 0 , θ 1 are shown separately.The MAEs, defied by Eq. 16, are computed over all available evaluation data, indexed by the set E.

Figure 10 :
Figure 10: Estimation error of D f (left) and θ D f (right).In Subfigure (a) the quality of the prediction of the fractal dimension per aggregate is visualized, where the colors are computed by means of a Gaussian kernel density estimator.In Subfigure (b), the error regarding the prediction of the model parameter θ D f is shown, where batches of size ν = 12 are used for the computation of θ D f .

Figure 11 :
Figure 11: Estimation error of ρ (left, center) and θ ρ (right).In Subfigure (a) the error of the unrounded output of the regression network is displayed per image.The black line, visualizing the mean error, is computed via a sliding window.The error for the output of the modified network is shown in Subfigure (b).In Subfigure (c) the error regarding the prediction of the model parameter θ ρ is displayed, where the modified values of ρ are used.

Figure 12 :
Figure 12: Estimation error of θ 0 .In Subfigure (a), the differences between the network output CNN 0 (I(B)) and the preset values θ 0 (B) of the model parameter θ 0 are shown.The prediction error of θ 0 after applying the rounding operation is displayed in Subfigure (b).

Figure 13 :
Figure 13: Estimation error of θ 1 .In Subfigure (a), the differences between the network output CNN 1 (I(B)) and the preset values θ 1 (B) of the model parameter θ 1 are shown.The prediction error of θ 1 after applying the rounding operation is displayed in Subfigure (b).

Figure 15 :
Figure 15: Prediction error for fractal dimension (left) and mixing ratio (right), depending on the number of particles per aggregate, for aggregates taken from the evaluation data given by the index set E. The color of dots is chosen according to a 1D Gaussian kernel density estimator along the y-axis.

Figure 16 :
Figure 16: Distribution of the average heterogeneous coordination number Z hetero (A) for three different values of θ 0 .The dashed vertical lines show the mean value of Z hetero (A) for each of the three specifications of θ 0 , computed for a total of 2400 simulated aggregates A.

Table 1 :
Structural descriptors used for evaluating the model parameter prediction.For formal definitions of the descriptors, see Section 3.4.

Table 5 :
Discrepancy between the mean values of the distributions of S TiO 2 , Z hetero , and Z total , with respect to the mean absolute error MAE and the coefficient of determination R 2 , computed for the 50 preset specifications of θ and their predictions θ.