Warwick Electron Microscopy Datasets

Large, carefully partitioned datasets are essential to train neural networks and standardize performance benchmarks. As a result, we have set up a new dataserver to make University of Warwick electron microscopy datasets available to the wider community. There are three main datasets containing 19769 scanning transmission electron micrographs, 17266 transmission electron micrographs, and 98340 simulated exit wavefunctions, with multiple variants of each dataset for different applications. Each dataset is visualized by t-distributed stochastic neighbour embedding, and we have created interactive visualization tools. Our datasets are supplemented by source code for analysis, data collection and visualization.


Introduction
We have set up a new dataserver 1 to make our large new electron microscopy datasets available to both electron microscopists and the wider community. There are three main datasets containing 19769 experimental scanning transmission electron microscopy 2 (STEM) images, 17266 experimental transmission electron microscopy 2 (TEM) images and 98340 simulated TEM exit wavefunctions 3 . Experimental datasets represent general research and were collected by dozens of University of Warwick scientists working on hundreds of projects between January 2010 and June 2018. We have been using our datasets to train artificial neural networks (ANNs) for electron microscopy [3][4][5][6][7] , where standardizing results with common test sets has been essential for comparison. This paper provides details of and visualizations for datasets and their variants, and is supplemented by source code for analysis, data collection, and both static and interactive visualizations 8 .
Machine learning is increasingly being applied to materials science 9, 10 , including to electron microscopy 11 . Encouraging scientists, ANNs are universal approximators 12 that can leverage an understanding of physics to represent 13 the best way to perform a task with arbitrary accuracy. In theory, this means that ANNs can always match or surpass the performance of contemporary methods. However, training, validating and testing requires large, carefully partitioned datasets to ensure that ANNs are robust 14,15 to general use. To this end, our datasets are partitioned so that each subset has different characteristics. For examples, by partitioning TEM or STEM images so that subsets are collected by different scientists, and by simulating exit wavefunction subsets with Crystallography Information Files 16 (CIFs) for materials published in different journals.
Most areas of science are facing a reproducibility crisis 17 , including artificial intelligence 18 . Adding to this crisis, natural scientists do not always benchmark ANNs against standardized public domain test sets; making results difficult or impossible to compare. In electron microscopy, we believe this is a symptom of most datasets being small, esoteric or not having default partitions for machine learning. For example, most datasets in the Electron Microscopy Public Image Archive 19, 20 are for specific materials and are not partitioned. In contrast, standard machine learning datasets such as CIFAR-10 21, 22 , MNIST 23 , and ImageNet 24 have default partitions for machine learning and contain tens of thousands or millions of examples. By publishing our large, carefully partitioned machine learning datasets, and setting an example by using them to standardize our research, we aim to encourage higher standardization of machine learning research in the electron microscopy community.
There are many popular algorithms for high-dimensional data visualization [25][26][27][28][29][30][31][32] . We use the scikit-learn 33 implementation of t-distributed stochastic neighbour embedding 34,35 (tSNE) as it is popular in the machine learning community. To reduce tSNE computation and data noise, we first apply probabilistic 36,37 principal component analysis 38 (PCA) to reduce the number of features in each image. This approach is used in the tSNE paper 34 and works well in practice. Minka's algorithm 39 could be used to obtain the optimal number of principal components; however, that would require 33 increased computation for singular value decomposition 40 Table 1. Examples and descriptions of STEM images in our datasets. References put some images into context to make them more tangible to unfamiliar readers.

Scanning Transmission Electron Micrographs
We curated 19769 STEM images from University of Warwick electron microscopy dataservers to train ANNs for compressed sensing 5,7 . Atom columns are visible in roughly two-thirds of images, and similar proportions are bright and dark field. In addition, most signals are noisy 48 and are imaged at several times their Nyquist rates 49 . To reduce data transfer times for large images, we also created variant containing 161069 non-overlapping 512×512 crops from full images. For rapid development, we have also created new variants containing 96×96 images downsampled or cropped from full images. In this section we give details of each STEM dataset, referring to them using their names on our dataserver. The distribution of STEM images is shown in figure 1 for STEM images downsampled to 96×96, and the distribution of structure in 96×96 crops from STEM images is shown in figure 2. Both visualizations were creating by embedding the first 50 principal components of images in two dimensions with tSNE. We used a perplexity of 127.4, 10000 iterations, and scikit-learn defaults for other parameters. Interactive visualizations that display images when you hover over map points are also available 8 . This paper is aimed at a general audience so readers may not be familiar with STEM. Subsequently, example images are tabulated with references and descriptions in table 1 to make them more tangible.

Transmission Electron Micrographs
We curated 17266 2048×2048 TEM images from University of Warwick electron microscopy dataservers to train neural networks to improve signal-to-noise 4 . However, our dataset was only available upon request. It is now available on our new dataserver 1 . For convenience, we have also created a new variant containing 96×96 images that can be used for rapid ANN development. In this section we give details of each TEM dataset, referring to them using their names on our dataserver. To show the distribution of TEM images in figure 3, we embedded the first 50 principal components of 96×96 images in two dimensions with tSNE. We used a perplexity of 131.4, 10000 iterations, and scikit-learn defaults for other parameters. An interactive visualization that displays images when you hover over map points is also available 8 . This paper is aimed at a general audience so readers may not be familiar with TEM. Subsequently, example images are tabulated with references and descriptions in table 2 to make them more tangible.

Exit Wavefunctions
We simulated 98340 TEM exit wavefunctions to train ANNs to reconstruct phases from amplitudes 3 . Half of wavefunction information is undetected by conventional TEM as only the amplitude, and not the phase, of an image is recorded. Wavefunctions were simulated at 512×512 then centre-cropped to 320×320 to remove simulation edge artefacts. Wavefunctions have been simulated for real physics where Kirkland potentials 59 for each atom are summed from n = 3 terms, and by truncating Kirkland potential summations to n = 1 to simulate an alternative universe where atoms have different potentials. Wavefunctions simulated for an alternate universe can be used to test ANN robustness to simulation physics. For rapid development, we also downsampled n = 3 wavefunctions from 320×320 to 96×96. In this section we give details of each exit wavefunction dataset, referring to them using their names on our dataserver.  64 format. All series were created with a common, quadratically increasing 65 defocus series. However, spatial scales vary and must be fitted as part of reconstruction.
In detail, exit wavefunctions for a large range of physical hyperparameters were simulated with clTEM 66, 67 for acceleration voltages in {80, 200, 300} kV, material depths uniformly distributed in [5, 100) nm, material widths in [5,10) nm, and crystallographic zone axes (h, k, l) h, k, l ∈ {0, 1, 2}. Materials were padded on all sides with vacuum 0.8 nm wide and 0.3 nm deep to reduce simulation artefacts. Finally, crystal tilts were perturbed by zero-centred Gaussian random variates with standard deviation 0.1 • . We used default values for other clTEM hyperparameters. Simulations for a small range of physical hyperparameters used lower upper bounds that reduced simulation hyperparameter ranges by factors close to 1/4. All wavefunctions are linearly transformed to have a mean amplitude of 1.
Visualization of complex exit wavefunctions is complicated by the display of their real and imaginary components. However, real and imaginary components are related 3 and can be visualized in the same image by plotting them in red and blue colour channels, respectively. Distributions of 96×96 simulated wavefunctions are shown in figure 4, figure 5, and figure 6 for a large range of materials and physical hyperparameters, a large range of materials and a small range of physical hyperparameters, and In 1.7 K 2 Se 8 Sn 2.28 and a large range of physical hyperparameters, respectively. Visualizations were creating by embedding the first 50 principal components of amplitudes in two dimensions with tSNE. We used perplexities of 190.6, 108.9 and 69.5, respectively, 10000 iterations, and scikit-learn defaults for other parameters. Interactive visualizations that display amplitudes when you hover over map points are also available 8 . All amplitude images show atom columns.

Discussion
The best dataset variant varies for different applications. Full-sized datasets can always be used as other dataset variants are derived from them. However, loading and processing full-sized examples may bottleneck training, and it is often unnecessary. Instead, smaller 512×512 crops, which can be loaded more quickly the full-sized images, can often be used to train ANNs to be applied convolutionally 68 to or tiled across 4 full-sized inputs. In addition, 96×96 dataset variants can be used in the early stages of development to rapidly train small ANNs. However, subtle application-and dataset-specific considerations may also influence the best dataset choice.
In practice, electron microscopists image most STEM and TEM signals at several times their Nyquist rates 49 . This eases visual inspection, decreases sub-Nyquist aliasing 69 , improves display on computer monitors, and is easier than carefully tuning

5/16
sampling rates to capture the minimum data needed to resolve signals. High sampling may also reveal additional high-frequency information when images are inspected after an experiment. However, this complicates ANN development as it means that information per pixel is often higher in downsampled images. For example, partial scans 5 across STEM images that have been dowsampled to 96×96 require higher coverages than scans across 96×96 crops for ANNs to learn to complete images with equal performance. It also complicates the comparison of different approaches to compressed sensing. For example, we suggested that sampling 512×512 crops at a regular grid of probing locations outperforms sampling along spiral paths as a subsampling grid can still access most information 5 .
Test set performance should be calculated for a standardized dataset partition to ease comparison with other methods. Nevertheless, training and validation partitions can be varied to investigate validation variance for partitions with different characteristics. Default training and validation sets for STEM and TEM datasets contain contributions from different scientists that have been concatenated or numbered in order, so new validation partitions can be selected by concatenating training and validation partitions and moving the window used to select the validation set. Similarly, exit wavefunctions were simulated with CIFs from different journals that were concatenated or numbered sequentially. There is leakage 70,71 between training, validation and test sets due to overlap between materials published in different journals and between different scientists' work. However, further leakage can be minimized by selecting dataset partitions before any shuffling and, for wavefunctions, by ensuring that wavefunctions simulated for each journal are not split between partitions.
Experimental STEM and TEM image quality is variable. Images were taken by scientists with all levels of experience and TEM images were taken on multiple microscopes. This means that our datasets contain images that might be omitted from other datasets. For example, the tSNE visualization for STEM in figure 3 revealed some images where scans are incomplete, ∼50 blank images, and images that only contain noise. Similarly, the tSNE visualization for TEM in figure 3 revealed some images where apertures block electrons, and that there are small number of unprocessed standard diffraction and convergent beam electron diffraction 72 patterns. Although these conventionally low-quality images would not normally be published, they are important to ensure that ANNs are robust for live applications. We encourage readers to try our interactive tSNE visualizations 8 for detailed inspection of our datasets.
Features extracted with PCA result in tSNE visualizations where image positions are related to total intensities. For example, in figure 1 dark and bright field STEM images are separated, and clusters at the top and bottom contain images with inverted intensity structure. Other approaches to image feature extraction could base visualizations on different image characteristics. For example, image features could be extracted with deep learning by using the latent space of an autoencoder 73,74 or variational autoencoder 75 , or by using features before logits in a classification ANN 76 . We have posted autoencoders for electron microscopy with pre-trained models 77, 78 that could be improved. Alternatively, image features could be extracted from a histogram of oriented gradients 79 , with speeded-up robust features 80 , from local binary patterns 81 , by wavelet decomposition 82 or with other methods 83 . The best features to extract for a visualization depend on its purpose.

Conclusion
We have provided details and visualizations for large new electron microscopy datasets available on our new dataserver. Datasets have been carefully partitioned into training, validation, and test sets for machine learning. In addition to full-sized datasets, we have provided variants containing 512×512 crops to reduce data loading times, and examples downsampled to 96×96 for rapid development. Source code and interactive dataset visualizations are provided in a supplementary repository to help users become familiar with our datasets. By making our datasets available, we aim to encourage standardization of performance benchmarks in electron microscopy and increase participation of the wider computer science community in electron microscopy research.

Data Availability
The data that support the findings of this study are openly available. Large new TEM, STEM, and exit wavefunctions datasets are on our new dataserver 1 . Source code for data collection, figure preparation, and interactive visualizations are in a GitHub repository 8 . For additional information contact the corresponding author (J.M.E.).
We do not have plans to host additional datasets from external users. However, we are open to hosting and encourage inquiry. We have funding for our public dataserver until at least 2030 and expect data to be moved to another archive if ours is no longer maintained. Datasets are accessed via hyperlinks on a main page so that we can change the physical locations of data without affecting users.