Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning

Pulse timing is an important topic in nuclear instrumentation, with far-reaching applications from high energy physics to radiation imaging. While high-speed analog-to-digital converters become more and more developed and accessible, their potential uses and merits in nuclear detector signal processing are still uncertain, partially due to associated timing algorithms which are not fully understood and utilized. In this paper, we propose a novel method based on deep learning for timing analysis of modularized detectors without explicit needs of labeling event data. By taking advantage of the intrinsic time correlations, a label-free loss function with a specially designed regularizer is formed to supervise the training of neural networks (NNs) towards a meaningful and accurate mapping function. We mathematically demonstrate the existence of the optimal function desired by the method, and give a systematic algorithm for training and calibration of the model. The proposed method is validated on two experimental datasets based on silicon photomultipliers as main transducers. In the toy experiment, the NN model achieves the single-channel time resolution of 8.8 ps and exhibits robustness against concept drift in the dataset. In the electromagnetic calorimeter experiment, several NN models (fully-connected, convolutional neural network and long short-term memory) are tested to show their conformance to the underlying physical constraint and to judge their performance against traditional methods. In total, the proposed method works well in either ideal or noisy experimental condition and recovers the time information from waveform samples successfully and precisely.


Introduction
Pulse timing is an important research topic in nuclear detector signal processing, with applications ranging from high energy physics to radiation imaging.Accurate time measurements are meaningful for precisely determining the vertex of interactions as well as the dynamics of incident particles.In the past decade, high-speed analog-todigital converters (ADC) were designed for front-end electronics of nuclear detectors [1] and incorporated into their electronic dataflow.Traditionally, some fixed algorithms can be used to extract time information from a time series of pulse samples, such as leading edge discrimination or constant fraction discrimination [2].However, when they work in noisy or changing conditions, the performance of these fixed algorithms drops significantly [3].
On the other hand, machine learning techniques, especially neural networks (NN) and deep learning, open another door for possible solution of time extraction from waveform samples.Recent literature has demonstrated that NNs can approximate the Cramér Rao lower bound in a broad range of working conditions [3].It is estimated that NNs, the key components of intelligent front-end processing, will be widely used in future nuclear detector systems [4,5,6,7], empowered by ever-growing developments of hardware acceleration for NN inference [8,9,10].
However, one pre-requisite for NNs to achieve superior performance is a justified reference (ground-truth label) in training.While labelled data are easily available in computer vision and many other machine learning tasks, they are not so for nuclear detector experiments.To provide accurate time references for real-world nuclear detectors, additional timing instruments or specific experimental schemes are needed.For example, in Ref. [11], the timing resolution of a high purity germanium detector was optimized based on the timing reference of a BaF 2 scintillation detector.In Ref. [12], the timing resolution and depth-of-interaction of monolithic scintillators were studied with a collimator crystal placed on the opposite side of the radioactive source.In Refs.[13,14,15], a back-to-back experimental setup was established to capture coincidence events, while NNs were used to predict the timing difference of paired detectors.In Ref. [16], NNs generated continuous value predictions based on the timing reference of leading edge discrimination.Although impressive results have been obtained in these studies, the significance of machine learning is limited either by the procedure (external calibration equipment) or the predicted target (preset timing differences).
Therefore, it is worthwhile to exploit the built-in structure of nuclear detectors for potential timing correlations and make use of them in the training process.The idea of training NNs without explicit labels was originally invented to locate or detect particular objects in images in the domain of computer vision [17].Researches in multiple disciplines combined the idea (i) to formulate loss functions to include a physicsconstrained term [18,19,20,21,22,23], (ii) to eliminate the canonical loss term and totally rely on physical relations [24,25], or (iii) to solve partial differential equations [26,27,28].
In this work, we focus on the aspect of timing analysis and propose a novel method building on the intrinsic modularization of detectors.We focus on silicon photomultipliers (SiPM) to generate electronic signals from detectable photons.The motivation of the work is to take full advantage of the universal approximation abilities of NNs while offering an available way to avoid using labelled data and to ease optimization of the model.Compared to some previous works which also used detector signals to estimate time-of-flight (such as [13,15]), the major innovations and contributions of the paper include: • A practical methodology to use NNs for pulse timing within the conventional nuclear detector dataflow without explicit needs of labelled data: a loss function from physical constraints in combination with a regularizer automatically guides the NN model to find the optimal solution.
• An algorithmic framework to generate the desired single-channel time estimates based on an arbitrary time origin by post-training calibration.
• Experiments with SiPM-based modularized detectors, validation on two experimental datasets, and demonstration of its feasibility and accuracy when applied to detector signals.

System architecture
The diagram of the proposed system architecture is shown in figure 1.It is based on a paradigm of modern nuclear detectors.The modularized detector array is composed of many detection cells with similar structures and readout channels.In certain conditions, an incident particle will penetrate several detection cells and generate response signals in multiple channels.By monitoring the threshold-crossing amplitudes of signals, it is possible to select fired (being hit with a considerable amplitude) cells and record their locations.The readout electronics keep track of electrical signals from fired cells and send them to following procedures.The next step is to digitize the analog signals from the readout electronics and to produce series of waveform samples at discrete, equispaced timestamps.As a conventional way of treatment, the digitized waveform samples are ready for feature (time, energy, etc.) extraction by digital logic.To ensure the proper optimization of NNs, we add delay blocks to randomly adjust the time origin of each series of waveform samples and record the adjustment for later use in the loss function (see section 2.2).
NNs with shared weights are applied to the outputs of the delay blocks, and each of them is designed to generate a time-of-arrival.Weights sharing is the key to make NNs work in a consistent manner so that a single NN can be used for time prediction with an absolute measure.Here, we intend to incorporate NNs into the system either through offline data analysis or through online processing [29,30].Application specific integrated circuits (ASIC) or field programmable gate arrays (FPGA) are candidates for inference devices.Finally, a loss function based on physical constraints, locations of fired cells and delay adjustments is formulated upon the outputs of NNs and back-propagates the residuals to each shared weight with gradient descent optimization.

Mathematical perspectives
We start by considering the physical constraints inherited in modularized detectors.The most common kind of constraint is the linear constraint.It can be explained as follows: if we denote the moment when the incident particle hits the first cell as t 0 , the moments when it hits subsequent cells can be denoted as (t 0 +a 1 t c , t 0 +a 2 t c , ..., t 0 +a N −1 t c ), where N is the number of fired cells, t 0 and t c are unknown variables, and (a 1 , a 2 , ..., a N −1 ) are known constants.For common geometric structures, linear constraints can be derived very naturally.For example, scintillator bars compactly arranged in an array can be characterized by a linearly constrained timing correlation.In another example, readout signals of different channels in a time projection chamber can also be linearly correlated if the incident particle is not significantly affected by the electric field.By equating linearly-correlated moments with model predictions, we get: where : When N ≥ 3, a 0 is fixed at 0, and a 1 , a 2 , ..., a N −1 are not all zeros, the system of linear equations in equation ( 1) is overdetermined and generally has no accurate solutions.Minimizing the sum of squared residuals ||Y − A [t 0 t c ] T || 2 2 will yield t0 tc T = (A T A) −1 A T Y , and the solution can be plugged into the residuals: (2) For convenience, we define M ≡ I − A(A T A) −1 A T .It can be easily verified that M is a symmetric matrix and M 2 = M .Therefore, equation ( 2) can be rewritten as: Next, we will consider how to represent each model prediction y i , the output of individual NNs.y i depends on waveform samples, which originates from pulse signals generated by the nuclear detector and readout electronics.The time origins and shapes of pulse signals are determined by intrinsic responses of the detector and also extrinsic events.Some assumptions can be made to simplify the analysis: (i) The intrinsic responses of the detector are time-invariant.In other words, the distributions of pulse signals will not change because of time origins.
(ii) The intrinsic responses of different cells have good uniformity so that we can use the same pulse shaping function with similar parameters.
(iii) The uncertainties of time origins are homogeneous so that we can use a single normal distribution to characterize uncertainties of different events.
For assumption (i), it disregards the long-term drift of nuclear detectors due to temperature, ageing, etc., which is a justifiable assumption after each calibration.Assumptions (ii) and (iii) are actually related to the weights sharing and loss formulation of NNs.For theoretical analysis below, they can be weakened, but will also introduce more complex notations and proofs.As a reasonable simplification, we do not consider those variations at the current stage.
Under these assumptions, we write y i as: where f N N (•) is the mapping function of NNs, s k (•|θ, n) is the k-th waveform sampling point with parameters θ and noise n, t s is the sampling period, and ∆t i is an additional term for regularization.The second line of equation ( 4) is a simplified form of the mapping function under the above assumptions, where ∆T i is a random variable representing the irreducible spread of time measurements coming from variations of pulse parameters (θ), noise (n) and sampling process (s).Finally, we will give a proposition to demonstrate the existence of the optimal solution in common settings: Proposition 1 (Sufficient condition for a minimizer).Assume t 0 , t c are random variables.∆t 0 , ..., ∆t N −1 ∼ N (0, σ 2 1 ) and ∆T 0 , ..., ∆T N −1 ∼ N (0, σ 2 2 ), both of which are i.i.d random variables, and N ≥ 3. I(Y ) is from equation (3) and y i in Y is from equation (4), where a 0 is fixed at 0, and a 1 , a 2 , ..., a N −1 are not all zeros.For f : R → R, if: the following functional: ).The proof is in Appendix A. With this proposition, we guarantee to find a mapping function which is linearly correlated with the underlying physical constraint.Some remarks are made as follows: (i) The regularization terms (∆t i for i = 0, 1, ..., N − 1) are essential for the mapping function.It can be seen that, if σ 1 → 0, then k → 0. This means that the we always find a constant function, which is the "safest" solution under intrinsic uncertainties.To prevent our model from converging to this trivial solution, we must add the regularizers in the formulation of the loss function.
(ii) The linear coefficient k is not identity unless there is no intrinsic uncertainties (σ 2 → 0).In real-world cases when intrinsic uncertainties are present, we must perform post-training calibration to achieve identical measures.
(iii) The offset b can be an arbitrary constant once the model is well-trained.To make it consistent with time origins in experiments, post-training calibration is needed to make the model free from bias.
Finally, it should be noted that the proposition is only a sufficient condition, which may not always be necessary.However, according to the Occam's Razor, a general principle in machine learning, a "simple" model (like the linear function) of the underlying physical process is usually to be preferred if the model capacity is appropriate to avoid over-fitting.We will supplement the conclusion with more observations in the section of experimental results.

Algorithmic procedure
Based on the above theory and architecture, we propose the label-free model training and calibration procedure, as shown in Algorithm 1.
For training, weights of NN are initialized at first.Then in the main loop, the weights are optimized with the loss function constrained by physical information.We use ACQUIRE GEOMETRY() to represent the process of recording the number of fired cells and their locations, and ACQUIRE SIGNAL() to represent the process of reading out waveform samples from each fired cell to form examples ‡.Regularization by random shift is applied to the examples.Instead of sampling the random shift from a continuous Gaussian distribution, we actually draw samples from a discrete Bernoulli distribution with equal probabilities of taking +1 and −1 §.The shifted examples generated from the SHIFT WAVEFORM() process are propagated throughout NN, and the amount of random shift is subtracted from the output.Finally, a stochastic gradient descent is applied to the physics-constrained loss function to update the weights of NN.This procedure is repeated until we reach the maximum steps for training.
For calibration, the first loop is to gather enough examples to form a calibration dataset regardless of their geometric information.In the next nested loop, each example from the calibration dataset is shifted multiple times with the minimum step.Then these shifted examples are propagated throughout NN with well-trained weights to get multiple outputs.The process LINEAR FITTING() is applied to the outputs and generates a slope and an intercept for this example in the calibration dataset.Finally, the process COMPUTE MEAN() is applied to all the slopes and intercepts to determine their optimal values.The NN model with well-trained weights is shifted and scaled to generate the normalized model.
With the normalized model, it is ready to re-calibrate the output based on the actual sampling period of ADC at the front end, and on a required time origin synchronous to the experimental detector system.

A toy experiment
In this section, we consider a degenerated case: the predicted time from two cells should be equal.In this case, N equals to 2, and the matrix A T A will be singular and no longer invertible.However, we can still find a symmetric matrix M : (5) ‡ Throughout the paper, we use the term "sample" to refer to a scalar value, and the term "example" to refer to a vector of the ADC-sampled waveform composed of samples, or a time-related group of vectors, depending on the context.
§ It can be verified that Proposition 1 still holds for this distribution.
which satisfies M 2 = M and fully agrees with the constraint.In this case, Proposition 1 still holds with the only difference being the value of N .

Experimental setup
The setup for the toy experiment is shown in figure 2. A rapid laser pulse generator (GZTech YFL-PP-1.5-GR)produces green lights with 532-nm wavelength periodically at 100 kHz.An optical attenuator (Thorlabs NE40A) shrinks the intensity of the laser lights (effectively the number of photons) to a reasonable level.After that, a beam expander (YZL D60-M30) spreads the arrived photons uniformly to two SiPMs (Hamamatsu S13360-6025PE) devices soldered onto a printed circuit board.The SiPMs transduce the light signals into electronic pulses, which are conditioned by pre-amplification circuits and read out by a 10-GSPS digital oscilloscope (LeCroy WaveRunner8254).The experiment is conducted in a black box to avoid ambient light.
Though not perfectly aligned, the arrival time of a few photons at two SiPMs could be considered the same, and the intrinsic time resolution is largely determined by the electronics.Our goal is to produce a model capable of predicting the arrival time from waveform samples digitized by the oscilloscope, and also to find out the intrinsic time resolution of the analog channels.
We implement the NN model in Keras [31] with the TensorFlow [32] back end.The base model (f NN (•) in equation ( 4)) is a classic convolutional neural network (CNN) comprising several one-dimensional convolution layers and several fully-connected layers.The details of the architecture are described in Appendix B.1.The network capacity is relatively small compared to those used in computer vision.We use CNN as a representative of NN models because it reaches near-optimal performance in the toy experiment, while more variants of NNs are discussed in section 3.2.
The input to the base model is a series of 2048 waveform samples with an interval of 0.1 ns, from either the first channel or the second channel.To ensure the proper functioning of the physical loss function, an appropriate amount of randomness is vital for NN models to converge to meaningful mapping and generalize to unseen examples.The original data have already exhibited intrinsic randomness to some extent (see figure C1).To enhance the variations, an additional shift of the sampling window is applied to each example (see Appendix C for more details).
The output of the calibrated base model is a continuous value of the predicted arrival time with regard to the time origin of the sampling session∥.Two base models with weight sharing are built for two analog channels.For regularization, a random shift of +1 or −1 interval applies separately to each channel input and works as ∆t i in equation ( 4).We use the physical loss function in equation ( 3) to train the NN model with the Adam [33] optimization algorithm.More details about the configurations are described in Appendix C.1.

Main results
We follow Algorithm 1 to train the NN model for optimal weights, and to calibrate the model so that the scale of network outputs is adjusted to match the actual ADC interval.Figure 3a shows the correlation between the original (uncalibrated) outputs of NN and the number of shift intervals.While the slopes are negative as a result of the shift method, it is straightforward to transform them into positive numbers for Algorithm 1 to be applied.By examining this figure, we find that the linear dependency of predicted values on shift intervals is almost exact, and the plotted 100 examples are very consistent when measuring their slopes.Besides, the linear range of predicted values is reasonably wide to cover all examples in the dataset.
In figure 3b, we select the rising edge of an example in the test dataset on its original scale with physical meanings.Since timing differences in most examples are very tiny, for visual convenience, we show an extreme case with the maximum timing difference (the rightmost point in figure 3c).It can be seen that the waveform is clean with a low noise level.The predicted values give a correct timing relation (+0.086 ns) corresponding to this example.
The overall performance on the test dataset is presented as a histogram of calibrated timing differences between two channels, shown in figure 3c.Apart from several outliers, nearly all values locate near the origin and conform to a Gaussian distribution.It should ∥ In a large detector system, the time prediction will first be aligned to the ADC clock domain, and then synchronized to a sophisticated clock distribution system.be noted that the standard deviation of the Gaussian distribution is 12.5 ps (equivalent to 8.8 ps for a single channel), which is a considerably low resolution level, showing the good performance of the experimental system and also the NN model.
A comparison is conducted between the NN model and the digital constant fraction discrimination (CFD) [2] (a traditional timing method in nuclear electronics, see Appendix C.1 for details).To simulate the analog channel, we preprocess the waveform with a second-order low-pass filter.Figure 4 shows the comparison under different critical frequencies of the analog channel.It can be seen that CNN performs consistently better than digital CFD.The best performance of CFD lies in 1 ∼ 2 GHz critical frequency, because noise is more intense when the analog channel has higher bandwidth, which degrades the performance of CFD.However, CNN can still reach good performance even when no low-pass filtering is present, due to the intricate network architecture which works as a non-linear filter and exploits the information in the input waveform samples as much as possible.
It should be noted that the optimal critical frequency will change if we decrease the sampling rate of the digitizer.At a lower sampling rate, in order to gather enough sampling points at the fast-rising edge (which is important for pulse timing), a low-pass filter with narrower passband and smaller critical frequency is preferred.This will rule out more high-frequency noise as well as rapid-changing details of the electronic pulse.When the noise is not a dominant limiting factor, we expect the timing resolution to keep the same order of magnitude [3].

Robust against concept drift
During long time of data capturing with the oscilloscope, the real-time correspondence of two analog channels may change ¶.In this condition, the premise of the experiment has changed, and we call it a concept drift in machine learning: the arrival time of two analog channels can not be regarded as the

Distribution of time residuals CFD CNN
Figure 5: Histograms of timing differences by CFD and CNN when two distinct modalities exist in the dataset.We fit the histograms to a Gaussian mixture with two components.CNN gives enough resolution to distinguish two peaks, while CFD is unable to display such feature.
same for all examples in the dataset.By analysing the loss function, the concept drift introduces a bias term when predictions are assumed to be accurate.Though theoretical proof of the property under this condition is non-trivial and lies beyond the scope of the paper, empirical demonstrations can be made with quantitative results.
In figure 5, we visualize the distribution of timing residuals when the actual difference of arrival time between two analog channels takes two distinct values.To our surprise, the NN model can be trained to provide accountable predictions against concept drift, and achieve fairly well performance.To be more specific, we fit the histograms to the following function: where p(x; µ, σ) is the probability density of a Gaussian distribution.The fitting results are exhibited on the right side of the figure.For CNN, the standard deviation of each Gaussian component is ∼18 ps, which is slightly worse than the single modality condition but still remarkable.By examining the portion parameter (a) and mean parameters (µ 1 , µ 2 ), it can be seen that the means of two components are at a distance of approximately one sample interval (0.1 ns), and the combined mean value is located near the origin.For CFD, it is unable to separate two Gaussian peaks due to its relatively poor resolution.The standard deviation (∼53 ps) shows accordance with the single modality condition regardless of concept drift.
For CNN, it is interesting to draw an analogy with solving a second-order minimizing problem with bias in one of two components.Here, the major difference is that the variable originally in the analogical problem is itself a mapping function with its own inputs and trainable parameters, which makes the analytical solution intractable.

Electromagnetic calorimeter at NICA-MPD
In this section, we study a detector in a real-world high energy physics experiment -the electromagnetic calorimeter (ECAL) [34] of the Multi-Purpose Detector (MPD) [35] at the NICA collider.ECAL is a huge and fully modularized detector assembled in several hierarchies (tower, module, half sector, sector).The basic unit for assembling is the tower, which is a shashlik-type sampling calorimeter composed of alternate lead and scintillator plates stacked one kind after another.In each tower, 16 wavelength shift fibres penetrate the bulk of the detection area and guide the scintillation lights onto a SiPM, which will then be read out by front-end electronics.A combination of 2 × 8 towers form a module.The modules from different sectors differ in their shapes.In this experiment, we use a module from sector 8 (the sector in the outermost place of ECAL) as the subject.
When a particle with high energy (such as muons in the cosmic ray) enters the tower, it will generate an electromagnetic cascade, in which the absorber gradually deposits the energy and the scintillator transforms it to detectable lights transmitted through the fibres.Reaching sub-ns timing resolution is meaningful for ECAL to work as an auxiliary time-of-flight device for better particle identification.

Experimental setup
The setup for the ECAL experiment is shown in figure 6.The ECAL module is put horizontally in a dark chamber.Two trigger devices (scintillator with photomultiplier) are set above and beneath the ECAL module to capture cosmic muons passing through 8 towers of the ECAL module (the other 8 towers are not used).The outputs of two triggers are connected to the leading edge discriminator (CAEN N840) to generate step signals, which are reduced to a single pulse by the coincidence logic unit (CAEN N455).The coincidence pulse is transmitted as the trigger input to the switched capacitor digitizer (CAEN DT5742).The digitizer also takes the 8 analog channels of the ECAL module as inputs and samples the waveform at a rate of 5 GSPS.The digital data are sent to a desktop computer for storage.High voltage supply is provided to two triggers and the electronics of the ECAL module, and low voltage supply is provided to the electronics.
In most captured events, the cosmic muon leaves energy deposits in every passed tower.If we only consider the median interaction point between the particle and each tower, the time of incidence will exhibit a highly linearly-correlated relation between these towers + , as long as the trajectory of the cosmic muon reaches two triggers.
+ The relation is geometrically accurate except for the slightly different fibre length of individual towers.However, it will not influence the training process if we consider it as a constant bias to the loss function.However, the actual process of energy deposition, scintillation and photon transmission is non-trivial for the detector, and there is obvious electronic noise relative to the signal amplitude.These factors make the linear correlation only a target approximation.
Nevertheless, by utilizing it to form a loss function, we can train a model to work as an arrival time predictor and find out the intrinsic time resolution of the ECAL module at the same time.
The NN model is implemented with settings similar to the toy experiment.In this section, we construct the base model with three types of NNs: the fully-connected (FC) network, the convolutional neural network (CNN) and the recurrent neural network (RNN) with long short-term memory (LSTM).The details of their architectures are described in Appendix B.2.The input to the base model is a series of 800 waveform samples with an interval of 0.2 ns, from one of the eight analog channels.More details about of the configurations are described in Appendix C.2.

Main results
Similarly to the toy experiment, we follow Algorithm 1 to train and calibrate the NN models.Figure 7 shows the calibration plots when we linearly shift the waveform samples and use NNs for time prediction.It can be seen that all three models (FC, CNN and LSTM) have successfully learned the linear dependency on input data without explicit labels in training.By carefully examining the figures, we find two interesting phenomenons: first, from the appearance of these figures, LSTM achieves the highest linearity when compared to the other two models; second, from the calculated average slope, CNN achieves the largest absolute value when compared to the other two models.Intuitively, LSTM might be the best of the three because it conforms better to the prior (linearly shifted predictions corresponding to linearly shifted input data); practically, CNN could also be used, because it makes the best use of training data to reduce the loss function (reflected by the average slope), although it seems to fluctuate in its predictions when shifting the input data linearly.
The well-trained NN models are evaluated on the test dataset.For each example   In order to analyse the dependence of time resolution on R value, and to fairly compare NN models with equal numbers of examples in the test dataset, we draw the timing performance versus the q-th quantile of R value, as illustrated in figure 9.When increasing the q value, decreasing number of examples with improved linearity are included for calculation.It can be seen that, in the aspect of time resolution on the same quantile, CNN performs steadily better than LSTM, and LSTM performs steadily better than FC.There is a nearly fixed gap between them when q value is below 0.8, and the gap quickly narrows when q value is 0.8 or larger.
Finally, we evaluate the NN models in comparison with other traditional timing methods.Two well-known methods, already validated in many real-world circumstances, are studied: the digital constant fraction discrimination (CFD) and the leading edge discrimination with slewing correction (LED-SC).The parameters used when applying these two methods are discussed in Appendix C.2.To make the comparison, we use the distribution of R value by LSTM as a reference to select examples with the particular linearity and to test all methods with the same examples.The test result is shown in figure 10.It can be seen that, the performance of traditional methods is on the same order as NN models, but is apparently worse in this experimental condition.Since the ambient noise is significant and the signal-to-noise ratio is relatively low, accurate timing is challenging for traditional methods based on partial information in waveform samples, while NNs are more competent at this work by exploiting information in the input data as much as possible.

Discussion
Based on the observations above, we make the following points: (i) In general, NN models work pretty well as desired when trained with the physical constraint in the label-free setting.Among them, LSTM achieves the highest linearity in its prediction, while CNN achieves the best performance.Which model is used in practice should be judged by the particular needs and also the objectives of measurement.
(ii) In these two experiments, the physical quantity being learned is directly a time-ofarrival in the ADC clock domain, rather than the timing difference of physically coupled measurements [13,15].This could be a prominent advantage if the detector works in stand-alone conditions, or in conditions where the matching property is not present.Besides, the self-supervised learning scheme eliminates the need of external equipment as timing reference [11].
(iii) In the final part of the main results, we make a comparison between traditional methods and NNs and show the advantage of the latter.It should be mentioned that traditional methods can also reach the theoretically best resolution in proper conditions [3], which can be satisfied by sophisticated design.However, NNs are promising in the long run, because they provide more flexibility and adaptability with guaranteed performance.More benefits are expected when hardware acceleration of NNs on front-end electronics is adopted in more extensive situations.

Conclusion
In this paper, we propose a deep learning-based methodology and algorithmic implementation to analyse the timing performance of SiPM-based modularized detectors without explicitly labelling examples of event data.A formal mathematical framework is built to ensure the existence of the optimal mapping function desired by the method.The major innovation of the paper is to present a physics-constrained loss function and its associated regularizer customized for waveform sampling systems.Two experiments (toy experiment, NICA-MPD ECAL experiment) demonstrate the superior performance and adequate physical implication of NN models.In the toy experiment, timing resolution of less than 10 ps has been achieved with a dual-channel coincidence setup.
In the ECAL experiment, a variety of NN models have been investigated and compared to traditional methods in terms of timing performance.
It should be noted that neural networks are versatile and able to be integrated with other feature extraction tasks, such as energy estimation [36] and position estimation [12].Besides, the ability of deep learning to exploit information in noisy conditions indicates possible applications in harsh experimental environment, such as the time-of-flight measurement of laser-accelerated particles with interference of highlevel electromagnetic pulses [37].
In the future, we will deploy the method on online processing hardware to achieve lower latency and higher energy efficiency.Existing hls4ml [8,9] framework is proposed on FPGA platform with digital signal processors (DSP) available for fast inference (several µs to several hundred µs latency).On ASICs where no DSPs are ready for use, fully-customized neural network accelerators [29,30] are currently in development (∼100 µs latency).4, and the stride is fixed at 2. For LSTM layers, we use the LSTM class in the Keras package and make each unit generate a sequence instead of a single value.In training, for each example, a series of 800 waveform samples (160-ns time window) is selected from 1000 original samples.Figure C1b illustrates 30 such examples.
To improve the variations of data, we randomly choose the origin (starting point) of the time window in a range of 25 sampling intervals (5 ns).The batch size we use is 32.For FC and CNN, the training lasts for 400 epochs, and 50 epochs for LSTM.For CNN, the training loss curve is shown in figure C2b.Each base model is calibrated with the training dataset and evaluated on the test dataset.
To test digital CFD, a second-order low-pass filter with 0.02-GHz critical frequency (at 5-GHz sampling rate) applies to waveform samples before the timing method (the critical frequency is picked to produce the best results).The threshold is also at 0.5 • S max .
To test LED-SC, the leading edge of the waveform is fitted to a third-order polynomial function before solving for the time point crossing a fixed threshold.The time residual is calculated with the triple of adjoined analog channels, and the amplitudes are recorded alongside the time.To correct the timing result, the following time-amplitude relation is used: where A i is the amplitude from one of the analog channels.We fit all examples to the above function with the amplitude from each channel by turns.We correct the time residuals by subtracting the fitting results depending only on the amplitude.The correction is repeated 3 rounds.

Figure 1 :
Figure 1: The system diagram of the proposed method based on the conventional nuclear detector dataflow.The modularized detector array provides information of fired cells and their locations for readout electronics to select valid channels and for the training back end to construct physics-constrained formulation of the loss function.

Algorithm 1
Label-free model training and calibration Require: w 0 : initial weights; f N N (•; w): NN model; η: learning rate; T : steps for training; P (∆t): probability distribution for regularizers; D: size of calibration dataset; G: steps for linear shift.Model training: w ← w 0 ▷ Initialize weights for the NN model

Figure 2 :
Figure 2: Devices and equipment for the toy experiment.

Figure 3 :
Figure 3: Demonstration of NN predictions in the toy experiment.(a) To calibrate the model, all examples are equidistantly shifted in a given range and fed to the base model for prediction.The average slope is computed by all slopes from linear fitting.Here, we plot the original (uncalibrated) model outputs of first 100 examples, which show very high linearity and consistency.The calibration will involve normalizing the model to the scale of sampling intervals and rescaling with the actual ADC interval (0.1 ns).(b) A segment showing the rising edge of an example in the test dataset, on its original scale, and associated timing prediction results.(c) Histogram of all timing differences in the test dataset when no low-pass filtering is performed.

Figure 6 :
Figure 6: Devices and equipment for the ECAL experiment.

Figure 7 :
Figure 7: Demonstration of the calibration process (showing only the first 100 examples) with equidistantly shifted examples and their original (uncalibrated) model outputs (See figure 3a for a reference): (a) Fully-connected network; (b) Convolutional neural network; (c) Recurrent neural network with long short-term memory.

Figure 8 :
Figure 8: (a)(c)(e) Histograms of the R value (Pearson correlation coefficient) distribution by linearly fitting NN predictions of 8 time-correlated analog channels, in the test dataset.The threshold (0.7) and the ratio of examples above the threshold are indicated.(b)(d)(f) Histograms of time residuals when examples with linearity above the threshold (0.7) are selected.We fill the histograms with triples of adjoined analog channels: (Ch i + Ch i+2 − 2 * Ch i+1 )/ √ 6.

Figure 9 :
Figure 9: Comparison of timing performance of NN models at different q-th quantiles of the R value.

Figure 10 :
Figure 10: The lower histogram shows the distribution of the R value with the LSTM model.Based on the histogram, the upper plot shows the timing performance of two traditional methods (digital constant fraction discrimination and leading edge discrimination with slewing correction) and also NNs at different R value thresholds.

Figure C1 :Figure C2 :
Figure C1: Visualization of single-channel waveform samples in (a) the toy experiment and (b) the NICA-MPD ECAL experiment.

Table B2 :
The architecture of FC in the ECAL experiment.

Table B3 :
The architecture of CNN in the ECAL experiment.

Table B4 :
The architecture of LSTM in the ECAL experiment.In experiments below, we perform neural network training on a laptop computer with a NVIDIA GeForce RTX 2070 graphic card (8 GB video memory).The training lasts for several minutes to one hour depending on the network complexity.
Appendix C.1.Toy experiment In the toy experiment, 10024 examples of paired waveform samples are collected in total.The data collection takes 17 minutes at a rate of about 10 examples per second.