Deep particle diffusometry: convolutional neural networks for particle diffusometry in the presence of flow and thermal gradients

Diffusion coefficient measurement is a helpful tool in revealing various properties of a fluid such as viscosity and temperature. However, determining the diffusion coefficient often requires specialized equipment. Particle-based techniques allow the use of conventional cameras to determine flow properties without any specialized measurement devices. However, the performance of existing methods such as single-particle and correlation-based measurements degrade drastically in the presence of real-world scenarios such as flow and thermal gradients. This work introduces a new method of estimating diffusion coefficient in the presence of flow and thermal gradients named deep particle diffusometry (DPD). The technique uses temporally averaged particle images as inputs and uses convolutional neural networks to predict the underlying diffusion coefficient. The results show that a high fit coefficient R 2 value of 0.99 was achieved with no or known fluid flow conditions and an R 2 value of 0.95 was achieved if the fluid had an arbitrary flow. Next, the generalization ability of the network was shown by training the DPD models on no gradient datasets and testing on datasets with a diffusion coefficient gradient. The networks maintained comparably high R 2 values of 0.96. Next, the DPD models were tested against three conventional methods on various simulated datasets, showing their superior performance in situations where an arbitrary flow was present along with diffusion. Finally, the networks were tested on experimental data and the predictions were compared with conventional methods which resulted in R2 values of 0.97 under the no-flow condition. The results show that the proposed method provides performance similar to existing methods on datasets with no flow or with a known flow and can surpass their performance on datasets that have an arbitrary flow.


Introduction
Diffusion is a process caused by a random motion of molecules in a fluid.The rate of diffusion, given by the diffusion coefficient (D), depends on physical factors such as the shape and size of diffusing particles, temperature, the viscosity of the fluids, etc.A high value of diffusion coefficient can indicate a high degree of random motion.Diffusion coefficient measurement has various applications in fields such as drug discovery [1], disease screening and detection [2], and bio-inspired fabrication [3].There are various ways by which the diffusion coefficient can be measured.These include spectroscopy-based methods (such as Fluorescence correlation spectroscopy [4], nuclear magnetic resonance spectroscopy [5], and dynamic light scattering [6]), microscopy-based methods (such as differential dynamic microscopy [7] and Fluorescence recovery after photobleaching [8]), and particle-based techniques (such as single-particle tracking (SPT) [9,10] and correlation methods [11][12][13]).
In particle-based methods, an otherwise featureless fluid is seeded with tracer particles which can be imaged to determine flow properties.These particles are chosen such that they provide a contrast in the image and neutral buoyancy with respect to the fluid [14].Various particle-based methods differ based on how the images are processed.For SPT, individual particles are tracked over multiple frames, and the mean square distance (MSD) of the particles is calculated, used to fit a power law, and provide the value of the diffusion coefficient [10,15].A limitation of single-particle tracing is that it depends on particle identification, which is not a trivial task when particles are not in focus and do not appear as Gaussian [16].Alternatively, the MSDs can be extracted via particle image velocimetry [14,17] without knowing the individual particle locations and can be used in the aforementioned workflow.Finally [13], uses the width of autocorrelation and cross-correlation for the images are calculated and the difference between them is used to determine the diffusion coefficient.Even though these methods provide reliable performance in the absence of flow or thermal gradients, their performance degrades significantly in the presence of the aforementioned real-world scenarios.Hence, there is a need to develop alternative methods for processing particle-based diffusometry data that can predict diffusion coefficients in these scenarios.
Diffusion is a stochastic process.When particles are added to visualize the underlying fluid, one can only control the number of particles per unit volume but not the positions where the particles will appear in any given frame.This gives rise to spatial stochasticity in the analysis.The particle positions in a given frame depend on their position in the previous frame, underlying flow, and thermal diffusion (causing a random motion).Hence, the particle locations have temporal stochasticity.Therefore, any newly proposed method needs to account for a complicated representation space while processing the particle data for diffusion coefficient measurement.This motivates the use of deep learning algorithms for particle motion data analysis.
Recently, neural networks (a commonly used unit for deep learning algorithms) have found various applications for predicting properties of micro-scale [18] to macro-scale fluids [19].Specifically for particle-based techniques, neural networks have been used for both SPT [20,21] and correlationbased [22][23][24][25] applications.Even though there are existing works that use neural networks to predict diffusion coefficients from SPT data [26,27], the authors found no existing works that use neural networks to process the image directly for diffusion coefficient measurement.
The current work uses a convolutional neural network (CNN) based architecture trained in a supervised manner.The networks were trained to solve both classification and regression problems.Predicting the diffusion coefficient from the particle images is a regression task as the label lies on a continuous number line.Still, the classification setup was important as it allowed a better understanding of the effect of the spatiotemporal stochasticity present in the dataset on the CNN's performance.The networks were trained on simulated datasets but were tested on both simulated and experimental datasets.The initial training and testing were done on simulated datasets with no flow to test out the classification and regression setups and reduce the spatiotemporal stochasticity in the predictions.The results also show the impact of the number of frames on performance in a no flow setting.Next, the CNNs were trained and tested on known and arbitrary flow settings to overcome the limitation that no flow can be present while processing the datasets with conventional methods.Next, the networks were tested for generalization with datasets with a temperature gradient across the frame.Additionally, the networks were benchmarked against conventional methods and were applied to real data.
The rest of the paper is organized as follows.Section 2 describes the procedure and considerations for creating the simulated datasets used in the study.Additionally, it contains information about the experimental datasets used for testing the algorithms.Section 3 describes the novel method of deep particle diffusometry (DPD), the kinds of networks used, and how those were trained.Section 4 discusses various classification and regression results.Additionally, this section includes tests for the generalization ability of the networks, benchmarking against other methods, performance on experimental data, and includes two sanity checks for the CNN training.Finally, section 5 concludes the work, and future directions are discussed.

Creating datasets
Deep learning is a data-driven loss minimization process.Depending on the training process, the architecture used, and the type and complexity of the task, deep learning algorithms can require thousands [28] to billions [29] of data points during the training process.For particle-based diffusion coefficient measurement techniques, a data point would be a sequence of images.For supervised learning, in addition to the sequence of images, one needs the actual values (labels) of the diffusion coefficient.Experimentally acquiring thousands of videos is an expensive process and time-consuming and was not a focus of this work.So, in this work, various simulated datasets were created for training and evaluating the CNNs.Since testing the algorithms does not require such large data sets, the testing could be done on both simulated and experimental datasets.
In-house MATLAB scripts were used to generate the simulated datasets.Since for the current task, color does not contain any information, grayscale images were created.Each image had a size of 1024 px × 1024 px.Each image had a particle concentration of around 600 particles that were randomly added, with each particle diameter of 6 px.This leads to less than 1% of the image having non-background pixels.Hence, the images have a high degree of sparsity.The images had no noise and there was no out-of-plane motion for the particles between different frames.The diffusion process was modeled as a normal random number with 0 mean and a standard deviation of 1 multiplied by a constant depending on the diffusion coefficient, the time difference between the frames, and the scaling factor.Equation (1) shows the equation of particle location under pure diffusion in 1 dimension: here, X new is the new position of the particle, X old is the old position of the particle, D is the diffusion coefficient, ∆t is the time between two frames, and N (0, 1) represents a normal random number with a mean of 0 and a standard deviation of 1.The same process is used to get the new Y locations.Each sequence had 101 frames to have a temporal diversity in the dataset.Additionally, multiple repeats of each simulation configuration (i.e.multiple simulation batches) were made with random initial particle locations to provide spatial diversity.The networks were trained on one or more simulation batches to expose them to various spatiotemporal differences that are possible.Network training, validation, and testing were always done on mutually exclusive datasets.So, the datasets used for training, validation, and testing had their spatiotemporal differences.This was done to avoid over-fitting on the training batch and to avoid any shortcut learning [30].In this work, three different types of simulated datasets were created.These were no flow, arbitrary flow, and gradient with no flow datasets.Details about each of these datasets are given in the following subsections.

No flow dataset
This dataset was the pure diffusion case where there was no flow affecting the position of the particles and there was no diffusion gradient across the image.The dataset had 271 different diffusion coefficients ranging from 0.3 µm 2 s −1 to 3 µm 2 s −1 with a step size of 0.01 µm 2 s −1 .Five simulation batches were created for this dataset.In all the experiments, the first three simulation batches were used for network training, the fourth for validation, and the fifth for testing.

Arbitrary flow
The next family of datasets was created by adding a flow to the pure diffusion case.As there is advection in addition to diffusion, it leads to Peclet numbers (Pe) greater than 0. Pe is defined in equation ( 2): where L is the characteristic length, v is the velocity, and D is the diffusion coefficient.In this study, three different types of flows were used: Uniform flow, Couette flow, and Poiseuille flow.The arbitrary flow dataset can have any of these three flows or no flow at all.Flow directions for all of these were from left to right with a maximum velocity of 2 m s −1 .Similar to the above, each simulation batch had 271 different diffusion coefficients ranging from 0.3 µm 2 s −1 to 3 µm 2 s −1 , this leads to a Pe from 0 to 2.67×10 6 .For each flow type, five simulation batches were made with the same train-validation-test splits.

Gradient with no flow
The final dataset had no flow but there was a diffusion coefficient gradient across the image.The main aim of this dataset was to further test the generalization ability of the networks trained on no flow dataset.For this dataset, three simulation batches were created and used for testing.The diffusion coefficient gradient was created using a temperature gradient across the image.This temperature gradient in the images can be obtained using heat equations.Equation (3) shows the heat equation in the Cartesian coordinate system.
where T is the temperature, t is the time, and x, y, and z are the Cartesian coordinates.The image does not have a depth and the image can be considered infinity wide.This simplifies the above equation to 1-dimensional.To simplify the equation further, there was no flow, and the simulation was done for a case when the system reached a steady state.The simulations were done to mimic a situation where the topmost row of pixels was at a higher temperature (T2) than the temperature (T1) at the bottom-most row.Let the length of the image be L. Then the temperature at an arbitrary height x can be given as given in equation ( 4): where ∆T is the temperature difference (T 2 − T 1 ).This gives a linear temperature gradient in the setup.Next, it was assumed that the fluid under the simulation was water.Hence, the viscosity can be calculated with four parameters dynamic viscosity [31] as given in equation ( 5): where T is the temperature, a = 1.856 × (10 The viscosity is maximum at the bottom of the image.Given the temperature and viscosity values, the diffusion coefficient can be calculated as given in equation ( 6): where Kb (Boltzmann constant) = 1.38×10 −23 J K −1 and diameter is the particle diameter which is selected to be 400 nm to provide a particle size of 6 px after conversion.The diffusion coefficient was maximum at the top of the image and had a non-linear decrease moving toward the bottom of the image.Figure 1 shows the change in viscosity and diffusion coefficient for a single frame in the presence of temperature gradient and no flow.
As this dataset was only used for testing and the networks were trained on no flow dataset, the temperature was restricted between 273.15 K (freezing point of water) and 343.15K to obtain diffusion coefficients from 0.58 µm 2 s −1 to 3.05 µm 2 s −1 (slightly above the prediction limit of the trained networks).Additionally, a temperature gradient of 10 K for any given simulation.This leads to seven possible ranges between 273.15 K and 343.15 K.

Experimental dataset
In addition to various simulated data, the networks were also tested on experimental datasets.The experiments contained 500 nm particles recorded with a 40x magnification [32].The image sequences were acquired at 15 frames per second and the data from the first 3 min was used as it had a stable diffusion coefficient.The original diffusion coefficients in these experiments were smaller than the range used for the CNN training.So, these sequences were subsampled to obtain diffusion coefficients in the right range.In this subsampling, a given number of frames were skipped to obtain an effective diffusion coefficient in the range of 0.3 µm 2 s −1 -3 µm 2 s −1 .This resulted in multiple diffusion coefficients for each acquired video.

Deep particle diffusometry
DPD is a convolutional neural network-based approach to predicting diffusion coefficients using regions of particle images (instead of single-particle tracking).The networks were trained in a supervised fashion to solve two types of problems: multi-class classification and regression.
The information relevant to finding the diffusion coefficient lies in a spatiotemporal domain.A single frame alone is only a random placement of particles and does not contain any information about how particles diffuse over time.So, multiple frames must be processed to find the diffusion coefficient.There are various ways of processing multiple frames such as using LSTMs with CNNs [33] (that can be difficult to train and parallelize [34]) and CNNs with 3D convolutions (that have a higher number of parameters as compared to 2D convolutions).These architectural elements could be an acceptable choice, but for the current problem, the input space can be simplified further.
As the current particle images are very sparse and the background pixels contain no information, processing those pixels would not lead to good use of the computation.As the particles are expected to move in every frame, a part of the spatial location that was sparse in one frame can contain a particle in the next one.Hence, the information lying in the spatiotemporal domain can be projected to the spatial domain.This not only reduces the number of frames that need to be processed but also creates meaningful representations of tasks in the spatial domain.This makes it possible to use CNNs with 2D convolutions for the current task.In this work, it was obtained by temporally averaging either two, four, or eight frames.But the methodology can be extended to a higher number of frames.Figure 2 shows an eight-frame average for different diffusion coefficients for no flow dataset.As the diffusion coefficient increases, the particles move further away from their initial positions and the temporally averaged images start having more prominent 'cloudy' structures.

Network used
In this study, the architectures were inspired by ResNets [35].Alternatively, other CNN architectures [36][37][38] can also be used.The modified ResNet contained a first convolution layer followed by n-residual blocks.Each residual block contained two convolution layers with a skip connection bypassing the layers.This leads to 9-27 convolution layers depending on the hyperparameter choice (section 3.3).Finally, the network contains a fully connected layer (fc) to map features of the final layer to predictions.The network had batch normalization layers and ReLU activations after the convolution layers.The network also had a max-pool layer before the first residual block and an average pool after the last residual block to reduce the size of feature space.Figure 3 describes an overview of the ResNet-inspired architecture and the residual block.

Network training and testing
The network training (and validation) was done on random crops (256 px × 256 px) obtained from the averaged images.These random crops not only reduce the input image size but also provide stochasticity in the input space that can avoid local minima during training and promote generalization.The crop size was selected so that in each crop there were at least some particles in the region.It was empirically observed that 128 px × 128 px led to some crops with no particles in them but there were no crops with missing particles when the size was increased to 256 px × 256 px.The crop size can be reduced by increasing the number of particles or enforcing that the crops have at least one particle to avoid regions with no particles.
Once the networks were trained, they were tested on multiple frames to reduce the spatiotemporal bias individual crops have.For the current setup, each 1024 px × 1024 px average frame was divided into sixteen 256 px × 256 px crops.Each of these crops was individually fed to the network and the diffusion coefficient values were obtained for each that were averaged to get the final diffusion coefficient.
While using multiple average frames, the starting frames were separated by a temporal stride equal to the number of frames used in the average.Figure 4 shows an overview of training and testing data loaders.
During training, various data augmentations were done to increase the diversity of data and further promote generalization.These included: a random horizontal flip with a probability of 0.5, a random vertical flip with a probability of 0.5, and a random rotation between −180 • to 180 • .Additionally, augmentations that scale the images (such as resizing) were strictly avoided because they can change to pixel distance between the particles in two frames even if the diffusion coefficient was the same.
The networks were trained to optimize for one of the two tasks: classification and regression.For both tasks, the network architectures were almost the same.The only difference is that for classification, the number of neurons in the last fully connected layer is equal to the number of classes.Whereas, for regression, the layer has only one neuron.
For classification, the diffusion coefficient range was broken down into ten classes of equal lengths.The networks were trained to minimize the cross-entropy loss between the predicted and actual labels.This is a relatively difficult setting for the network because this loss does not account for how wrong the predictions are which makes it difficult for the networks to learn any distant relations between the classes.This difficult setting can be used to observe the spatiotemporal stochasticity in the dataset.
The next setting was regression, where the CNNs predict the numeric value of the diffusion coefficient.Here, the networks were trained to minimize the mean square error (MSE) between the predicted and actual labels.Hence, the training process accounts for the distance predicted from the actual value which fits the nature of the current problem.

Hyperparameter optimization
In addition to the problem setup, dataset, and architecture, a neural network's performance is highly dependent on the set of user-defined hyperparameters.There is not a fixed set of optimal hyperparameters, and it must be tuned each time any changes to the setup are made (such as the number of images used).Manual tuning of hyperparameters is a time-consuming process and is prone to human biases [39].To automate this process, the hyperparameters were tuned using Microsoft NNI [40]. Figure A1 provides an overview of the training process and the role of the outer and inner loops during the training process.
The role of the outer loop was to maximize the validation metrics: accuracy for classification and fit coefficient R 2 for regression.Accuracy is defined as the number of points where the predicted class is the same as the actual class divided by the total number of data points.The R 2 is obtained from Pearson's r which is formulated as equation ( 7): where r is Pearson's r, n is the total number of data points, i is the index representing a given data point, x i is the true value of a given data point, y i is the predicted value of a given data point, x is the mean of all the true values, and ȳ is the mean of all the predicted value in the dataset.R 2 describes how well the true and the predicted values are correlated.It ranges from 0 to 1, where a higher value indicates a higher degree of correlation, and hence the model is a better fit for the task.
To achieve that four hyperparameters were tuned using the Tree-structured Parzen Estimator Approach (TPE) [41]: batch size, learning rate, size of convolution filter for the first layer, and the number of residual blocks.For each run, the networks were trained from scratch on NVIDIA A100.The maximum experiment duration was 12 h and a maximum of 200 trials with a concurrency of four trials.During training the top ten best-performing models on validation were saved and the performance of the top three models on the test sets is reported in the results.

Results and discussion
The biggest consideration while testing the trained networks was reducing the effect of spatiotemporal stochasticity in the dataset.The target was to find a low stochastic testing strategy that can be used to compare different trained CNNs while keeping a relatively low computational requirement.As classification (with cross-entropy loss) was a harder task to achieve, it provided bigger performance differences and hence made it easier to compare testing strategies.Next, with the regression setting, the CNNs were tested under various conditions namely no flow, uniform flow, and arbitrary flow.The network was also tested for generalization with the regression setting.Next, two sanity checks were performed to ensure there was no shortcut learning involved and highlight the effect of flow while testing networks trained on pure diffusion datasets.Additionally, the performance of the proposed method was benchmarked against three other methods on the simulated data as the true value of diffusion coefficients was known for the simulated data.Finally, the diffusion coefficients obtained from the CNN models were compared against other methods, and their MSEs and R 2 are reported.

Reducing spatiotemporal stochasticity
Even though the networks were trained on random crops of the averaged images to promote data diversity during training, it is not a good strategy while testing the networks as it introduces unwanted spatial stochasticity.While testing the network trained on eight-frame averaged images, the single random crops produced a maximum accuracy difference of more than 7% in the top three CNN models.The run was only made twice, and this number can go higher if more runs are made.This spatial variation can be eliminated by deterministically cropping the entire image and averaging the predictions from each crop.This not only removes any spatial stochasticity but it was observed that it also improves the evaluation performance.Figure 5 shows the difference in the accuracy of the  top three CNN models evaluated on single crops vs. the entire image.
It can be seen that using the entire image was a more stable way of testing the averaged image than using single random crops.Next, the temporal stochasticity can be observed when the starting frame for image averages is changed.The starting frame is the first frame of the sequence used for the averaged images.As the motion of each particle is stochastic and independent of the others, the spatially averaged images will have different shapes each time the underlying frames are different.So, the predictions made by the network depend on the underlying frames.Figure 6 shows the change in accuracy of the top three CNN models with different starting frames for the averaged images.Eight images were averaged to train these CNN models.
From the figure, as the starting frame of the averaged image changes, the accuracy of the network varies.This indicates that there is some information lying in the temporal domain and multiple averaged frames would be needed to account for the stochasticity produced by it.

Optimizing classification
The best-performing classification models were obtained by empirically selecting good ranges of hyperparameters and performing using multiple average images for testing.The best classification model had an accuracy of 83.03% when evaluated on five sets of eight-frame averaged images.The starting frame for each set of averaged images had an offset of eight frames, i.e. the temporal stride was 8 frames.Figure 7 shows the confusion matrix produced on the test set by the best classification model.
It can be seen from the figure that most of the predictions lie on the diagonal representing the overlap of true and predicted classes.This means that CNNs can be used to represent diffusion coefficient prediction as a classification task.The predictions started getting more off-diagonal when the class number was higher, i.e. the diffusion coefficient was higher.It is because as the particles under diffusion have a random motion, for diffusion coefficients the particle movement can still be small, and a single crop can indicate a lower diffusion coefficient.Even when using multiple frames, the crops that indicate a lower diffusion coefficient can reduce the average prediction, and hence a lower class is obtained.One way of solving it is to use a greater number of frames for creating the averaged image to reduce the temporal stochasticity of the images.

Regression and reducing segments
Regression models (trained to optimize MSE) are a better fit for the current problem.The regression models showed a test  R 2 greater than 0.98 when trained with eight-frame averaged images.Next, the networks were trained with a reduced number of frames (4 frames and 2 frames) used for the averaged image.Figure 8 shows the impact of reducing the number of frames used for the averaged image on the performance of the network (quantified using R 2 on a single averaged image).
From the figure, the performance of the top three regression models is very similar.Also, even though the performance degrades as the number of frames used for the averaged image is reduced, this performance degradation is not drastic.Next, similar to classification models, the performance of the regression models can be improved by using multiple sets of nframe averages.Figure 9 compares the difference in R 2 of the top three regression models evaluated using a single two-frame averaged image vs. five sets of two-frame averaged images.
From the figure, the testing performance can be improved by using more sets of averaged images.It is because increasing the number of sets reduces the performance degradation caused by temporal stochasticity present in a single averaged image.Next, the best-performing regression models using two-frame averaged images were obtained by empirically selecting good ranges of hyperparameters and performing using multiple sets of two-frame average images for testing.The best regression model had an R 2 of 0.989 when evaluated on five sets of two-frame averaged images.Even though, the testing performance can be increased further by using more sets of average images, in this study a higher number of sets were not evaluated to limit the computational overhead.

Performance on uniform flow dataset
Next, the CNNs were trained only on the uniform flow dataset for the regression task.Even though there was a flow, the target of the CNN was to predict the diffusion coefficient.The best regression model using two-frame averaged images had a R 2 of 0.991 when evaluated on five sets of averaged images.Figure 10 shows the performance of best regression models obtained when the CNNs were trained individually on no flow and uniform flow datasets.From the figure, when the networks train on independent flows, they can still learn to identify the diffusion coefficient as it is a stochastic motion of particles in contrast to the flow that causes a deterministic particle motion.

Effect on number of training batches
The performance of a deep learning algorithm can also change by the amount (and diversity) of data that is used for training the network.So, for new use cases, it remains a point to explore.The current datasets were created in batches and three simulation batches were used in the previous experiments to train the networks.In this experiment, the number of simulated batches was reduced.Figure 11 shows the performance of the best regression models obtained when CNNs were trained on one, two, and three simulation batches.
From the figure, the performance of the network improves when multiple simulation batches were used for training.This is because as the initial particle locations were randomly assigned and the particle locations in the consecutive frames depend on the initial location, diffusion, and any flow present, there can be some particle location configurations that are not covered in a single simulation batch and can probably be covered in more batches.Additionally, even though the performance improved with a greater number of training batches, the improvement was not very significant.So, even with a single batch, a competitive performance was achieved when image crops were used.

Networks trained jointly with arbitrary flows
Even though, the networks showed good performance when trained on datasets with an individual flow, a drop in performance was observed when the networks were trained jointly on multiple flows using two-frame averaged images.As the current networks were trained on averaged images, the only information they had was the particle location in the given frames.The locations in the first frame were randomly assigned.In the consecutive frames, the change in initial locations is determined by the superposition of location Even though it is possible to distinguish between position change caused by diffusion and position change caused by a deterministic flow with three or more frames, it is a difficult task for two-frame average images.Hence, in the arbitrary flow setup, the networks were trained and tested with fourframe average images which increases the amount of temporal information present in the input image.This requires no change in the structure of the CNNs and it only requires retraining the networks with the four-frame average image setting.Figure 12 shows the performance of a CNN trained on four-frame average images with arbitrary flows.From the figure, the CNNs maintain a high performance (R 2 of 0.954).

Testing generalization
The next experiment was done to test the generalization ability of the networks.Here, the networks trained on the no flow dataset were tested on the gradient with no flow dataset.As the current gradient with no flow dataset has a 1-D gradient of diffusion coefficient, any crop of the images will have a range of diffusion coefficient values.In the current experiment, a stride of 128 pixels was used on the axis with a diffusion coefficient gradient that led to seven crop locations with different central diffusion coefficient values.On the axis with no gradient, a stride equal to crop size (256 px) was used which led to four crop locations with the same diffusion coefficient values per averaged image.To tackle the spatiotemporal stochasticity, predictions from these four crop locations in five sets of two-frame averaged images were averaged to give the final prediction.In this experiment, three simulation batches were individually used for testing.Figure 13 shows the mean and standard deviation of the predictions when training was done on no flow dataset and testing was done on three batches of the gradient with no flow dataset.
The figure shows that, even though the network was not trained on the gradient datasets, the predicted values follow the actual values very closely (R 2 of 0.961).As the previous results, the predicted values were slightly lower than the actual values when the value of the diffusion coefficient was higher.

Sanity checks
In addition to the previous experimentation, two sanity checks were made.The first check was to ensure that the information relevant to the task lies in the spatiotemporal domain and the network was not learning any shortcuts [30] from the single frames.To test this out, the networks were trained on crops of single frames from no flow dataset.The best-performing regression model had a R 2 of 0.002 and it always predicted approximately 1.65 µm 2 s −1 .This is the middle value in the diffusion coefficient range, and it provides the minimum mean square loss.Hence, to minimize the loss, the network learned to predict the middle value of the dataset and hence did not learn anything useful from the single images.Figure A2 shows the predicted vs. actual values when the network was trained on single frames.
The second sanity check was done to test out how the network performs when it was trained on a dataset with no flow but tested on one with the flow.In the dataset with flows, there was an inherent displacement of the particles even if the diffusion coefficient was low.As the network input was only the averaged images, a network trained on no flow is expected to predict higher diffusion coefficient values.Figure A3 shows the network predictions when trained on the no flow dataset and tested on datasets with and without flow.Here, five sets of two-frame averaged images were used.As the uniform flow dataset had the largest velocity, the predictions for images from this dataset were the largest.This flow was so large that the predictions were out of the range of the diffusion coefficient on which the CNN was trained.For Couette flow and Poiseuille flow, there was a flow gradient ranging from no flow to maximum flow (2 m s −1 ).So, the impact of flow was not as drastic as the uniform flow case and hence the network predictions were relatively closer to the actual values.

Benchmarking against other methods
The proposed deep learning-based method was benchmarked against three conventional methods.The first method was single-particle tracking-based and was implemented using TrackPy [15].The next method uses correlations to find the displacement and uses the ensemble MSD to find the diffusion coefficients.We refer to this as correlation-EMSD in this work.The final method was based on the width of auto-correlation and cross-correlation [13].We refer to this as Olsen-Adrian in this work.All four algorithms were benchmarked on both no flow and arbitrary flow datasets using 10 and 20 frames respectively.Figure 14 shows the (a) R 2 and (b) MSE for DPD, TrackPy, correlation-EMSD, and Olsen-Adrian methods.
From the figure, the R 2 values for all the methods are quite comparable if there is no flow.However, the performance of trackyPy and correlation-EMSD decreases in the presence of flow.Next, the MSEs were plotted on a logarithmic y-axis.While comparing the MSEs, it can be seen that DPD has lower MSEs, especially when there is no prior knowledge of the flow.

Evaluation on experimental data
The final evaluation was done on experimental data.As the CNNs were trained on Gaussian particles, they should be tested on similar shapes to avoid any unwanted dataset shift [42].To achieve this transformation, TrackPy was used on experimental images to extract particle locations.Next, Gaussian particles were added to these locations before feeding the images to the CNN for testing.
As the real data does not have true values similar to simulated data, the outputs of the CNNs were compared against the outputs of TrackPy and correlation-EMSD methods.These conventional methods are sensitive to the flow, hence the CNN used here was trained only on the no flow dataset.The comparisons were done as the MSEs and R 2 between the predictions from conventional methods and predictions from DPD models.The tests were done at three different initiator concentrations.Table 1 shows the R 2 and MSEs between conventional methods and DPD predictions.
From the table, DPD predictions have high R 2 values (approx.0.97) with respect to TrackPy and correlation-EMSD.Additionally, the MSEs between DPD and correlation-EMSD (approx.0.05) were lower than the MSEs between DPD and TrackPy (approx.0.1), both of which are quite low.

Conclusion and future work
Predicting diffusion coefficients from particle images in experimental scenarios is challenging and causes high errors while using existing methods.These scenarios include thermal gradients and the presence of flow in addition to diffusion coefficients.The current work provided a deep learning-based workflow for predicting diffusion coefficients in these scenarios.In this work, convolutional neural networks were trained on simulated datasets containing different types of flows in addition to diffusion.These networks were tested on both simulated and experimental datasets.
The proposed networks were trained with different temporal information obtained by averaging multiple frames (n > 1).The networks showed a R 2 value of around 0.99 in the presence of no flow or known uniform flow using two-frame averaged images.A slightly lower R 2 (0.95) was obtained when the underlying flow was unknown using four-frame averaged images.Additionally, the networks were tested on datasets with a diffusion coefficient gradient when these were trained on datasets without any gradient.The networks maintained a high R 2 of 0.96.Next, the network was benchmarked against other conventional methods on simulated data.It was found that the proposed methods maintain higher R 2 values and lower MSEs in the presence of flow as compared to other conventional methods.Finally, the proposed and existing methods were used on experimental data without any flow.The proposed method performed very similarly to existing methods (R 2 of 0.97 and MSEs smaller than 0.1) without ever using real data for training.The results show that the networks learn meaningful representations for predicting diffusion coefficients from particle images in various experimental scenarios.
Even though the current CNNs had better performance when compared to existing methods for data simulating realworld conditions, these algorithms are still running on a compute cluster and not deployed in the lab.Hence, the current models are limited to offline use.Existing techniques [2] are often implemented on smartphones as they contain the optical systems to record videos and enough computation power to run those algorithms.As the proposed deep learning method has significantly higher computational requirements as compared to the existing method in the current state, these models would be pruned [43] to reach lower computational requirements.Next, the current CNNs were trained on Gaussian particles, which is a commonly made assumption while modeling the particles.However, as the particles in a microscopic setting are often defocused, they produce more complicated shapes [44].The next generation of DPD models would be trained on the defocused particle images to overcome the Gaussian shape assumption.Another major assumption in this work was the physics of data generation.The current Brownian motion is a random walk and does not account for the particle-particle and particle-fluid interactions.One of the future tasks would be to create physics-accurate datasets to train the CNNs to improve their performance.

Figure 1 .
Figure 1.Change in viscosity and diffusion coefficient for a single frame in the presence of temperature gradient and no flow.

Figure 3 .
Figure 3.The ResNet [35] inspired architecture: (a) overview, (b) residual block.The overview has the first convolution layer (conv), n-repeating residual blocks each with two convolution layers, and the final fully connected layer (fc).

Figure 4 .
Figure 4.An overview of training and testing data loaders.

Figure 5 .
Figure 5.The difference in the accuracy of the top three CNN models evaluated on single crops vs. the entire image.

Figure 6 .
Figure 6.The change in accuracy of the top three CNN models with different starting frames for the averaged images.

Figure 7 .
Figure 7.The confusion matrix produced on the test set by the best classification model.

Figure 8 .
Figure 8.The impact of reducing the number of frames used for the averaged image on the performance of the network (quantified using R 2 on a single averaged image).

Figure 9 .
Figure 9.The difference in R 2 of top three regression models evaluated using a single two-frame averaged image vs. five sets of two-frame averaged images.

Figure 10 .
Figure 10.Performance of best regression models obtained when the convolutional neural networks were trained individually on no flow and uniform flow datasets.

Figure 11 .
Figure 11.The performance of the best regression models obtained when trained was done on one, two, and three simulation batches.

Figure 12 .
Figure 12.Performance of the convolutional neural network trained on four-frame average images with arbitrary flows.

Figure 13 .
Figure 13.Mean and standard deviation of the predictions when training was done on no flow dataset and testing was done on three batches of the gradient with no flow dataset.The three diagonal lines show the minimum, middle, and maximum values of diffusion coefficients possible for the image crops.The vertical dotted lines show the actual diffusion coefficient boundaries produced by the temperature ranges in the images.

Figure A1 .
Figure A1.An overview of the training process.The role of the outer loop is to provide good hyperparameter combinations that can be used by the inner loop is to provide models with maximum accuracy/R 2 .

Figure A2 .
Figure A2.Performance of the convolutional neural networks when trained on crops from single frames.

Figure A3 .
Figure A3.Network predictions when trained on no flow dataset and tested on datasets with and without flow.

Table 1 .
R 2 and mean square error (MSE) between conventional methods and DPD predictions.