UvA-DARE (Digital Academic Repository) Deep-learning based reconstruction of the shower maximum Xmax using the water-Cherenkov detectors of the Pierre Auger Observatory

: The atmospheric depth of the air shower maximum 𝑋 max is an observable commonly used for the determination of the nuclear mass composition of ultra-high energy cosmic rays. Direct measurements of 𝑋 max are performed using observations of the longitudinal shower development with ﬂuorescence telescopes. At the same time, several methods have been proposed for an indirect estimation of 𝑋 max from the characteristics of the shower particles registered with surface detector arrays. In this paper, we present a deep neural network (DNN) for the estimation of 𝑋 max . The reconstruction relies on the signals induced by shower particles in the ground based water-Cherenkov detectors of the Pierre Auger Observatory. The network architecture features recurrent long short-term memory layers to process the temporal structure of signals and hexagonal convolutions to exploit the symmetry of the surface detector array. We evaluate the performance of the network using air showers simulated with three diﬀerent hadronic interaction models. Thereafter, we account for long-term detector eﬀects and calibrate the reconstructed 𝑋 max using ﬂuorescence measurements. Finally, we show that the event-by-event resolution in the reconstruction of the shower maximum improves with increasing shower energy and reaches less than 25 g/cm 2 at energies above 2 × 10 19 eV.


Introduction
The depth of the air shower maximum max induced by ultra-high energy cosmic rays is a key observable to estimate the mass of primary cosmic particles (see e.g. [1] for a review). In this work we introduce a new method to reconstruct max from signals recorded by water-Cherenkov detectors (WCDs), whose duty cycle is almost 100%, unlike fluorescence telescopes, which can only operate during moonless nights with a duty cycle of about 15%.
The Pierre Auger Observatory [2] was constructed to study cosmic rays at the highest energies with high accuracy. With its instrumented area of 3000 km 2 the observatory is sufficiently large to measure the rare high-energy events up to 10 20 eV with adequate statistics. Designed as a hybrid instrument, on the one hand the development of the air showers is directly followed by observing the fluorescence light emitted by nitrogen molecules in the air with 27 telescopes, constituting the fluorescence detector (FD). As measurements using the fluorescence technique are only possible during moonless nights it is used to collect data for an in-depth understanding of the air showers [3][4][5][6] with reduced statistics. On the other hand, secondary particles of the air showers are detected by 1600 water-Cherenkov Detectors (WCDs) located on the ground, which form the surface detector (SD). The detectors are located at a distance of 1.5 km from each other in a hexagonal configuration (Fig. 1a) and measure the time-dependent density of secondary shower particles. The measurement is based on Cherenkov light emitted by the relativistic particles as they pass through each water-filled detector. This light is recorded by three photomultiplier tubes (PMTs) attached to the top of the tank and processed using flash analog-to-digital converters with a sampling rate of 40 MHz, translating to bins with a width of 25 ns.
For the determination of the air shower direction and energy with the WCDs there are already established precise reconstruction methods [7] that use the arrival times of the shower particles and the spatial extent of the air shower. Reconstructing the depth of maximum of the shower, on the other hand, is a major challenge. In contrast to the fluorescence telescopes, there is no direct observation of the shower development in the atmosphere. Instead, the shower properties are encoded in the signal traces induced by the secondary particles traversing the WCDs.
During the longitudinal development of the air shower, the relative abundances of hadrons, muons, electrons and photons change in addition to geometric dependencies such as the distance to the shower core or the zenith angle. The signal traces generated by these individual components in the three PMTs of the WCDs are different and thus provide the possibility to extract information about the depth of the shower maximum max . Whereas muons do not interact with the atmosphere and reach the detector usually in earlier times, the electromagnetic component is strongly shielded and reach the detector later and more spread in time. See Fig. 1b for a simulated example trace with muons and electromagnetic shower particles.
Experimental proof of the successful use of the signal characteristics to determine the depth of maximum of the shower has already been achieved. It was demonstrated [8,9] that using shower universality [10,11] max can be measured with good accuracy by decomposing the signal traces using templates of typical waveforms induced by the different shower components using the universality of signals with respect to the shower development stage. Furthermore, the correlation between the risetimes of the signals in the WCDs was used to determine max [12]. With this method the average composition of UHECRs was determined up to 100 EeV. The measurement up to such high energies has been only made possible using the large exposure and high reconstruction efficiency of the SD, increasing the statistics by a factor 25 above 3 EeV (factor 12 above 30 EeV) when compared to the FD [13].
Our new method aims to measure, beside the average composition, the mixing of the composition ( max ) with high statistics by exploiting mass-sensitive information on an event-by-event basis. This opens up possibilities for event-based anisotropy studies using information on the cosmic-ray mass.
This method to reconstruct max using the measured signals of the WCDs is based on a deep neural network (DNN), whose architecture and optimization was developed specifically for the measurement conditions at the Pierre Auger Observatory. The network is a further development of [14] and takes into account current developments in machine learning, especially elements from speech recognition and utilization of spatial symmetries.   Figure 1: (a) Simulated signal pattern measured by the surface detector. The marker sizes indicate the amount of measured signal and the colors represent the arrival time of the shower at a given station (yellow=early, red=late). The arrow denotes the projection of the shower axis on the surface and its tip the shower core. (b) Simulated signal trace of a cosmic-ray event measured at a surfacedetector station at a distance of about 1000 m to the shower core. Different colors indicate signals from different shower components.
In the hybrid network architecture, all time intervals of the signal traces of the PMTs are first scanned using a recurrent subnetwork. Thereby each signal trace is characterized by a total of 10 "machine-learned features". In the following network layers these observables are merged with the particle arrival times at the respective WCD locations. When processing the signals at the given positions, the hexagonal symmetry of the observatory is generically mapped into the network architecture. The network parameters for forming the 10 observables to characterize the signal traces and for combining all available information are adjusted in a training procedure. The most important tools of this optimization are the training data and the objective function, which is minimized during network training.
This work is structured as follows. First, we specify the data sets for both the simulation studies and measured Auger hybrid data, which include information from the FD for validation purposes. We explain in detail how the simulated data are prepared and augmented for the optimization of the network parameters and the reconstruction of max . After that, we describe in detail the architecture and training of the deep network. Then we show the max reconstruction performance of the network on simulated data as a function of energy, zenith angle, mass of the primary particle, and the effect of using two hadronic interaction models different from the one used in the training. Finally, we verify the capabilities of the network by direct comparison of the measured maximum shower depth max of the network and of the FD. We correct for detector-aging effects resulting from long-term operation of the observatory. Subsequently, we calibrate the absolute max value of the network output, and determine the max resolution of the network as a function of the primary energy.

Data sets and their preparation
The measured air shower footprint consists of a characteristic pattern of several triggered WCDs arranged in a hexagonal grid (see Fig. 1a). Using three PMTs each triggered station measures the time-dependent density of particles encoded in three signal traces. An example of a simulated signal trace is shown in Fig. 1b.
The basic idea is to provide the network as input the raw data of a measured cosmic-ray event. The raw information for each triggered station consists of three signal time traces, the station position and the time of the first shower particles arriving at the station.
For successful adjustment of the network parameters, careful preparations of the data sets from simulation campaigns used for the optimization process are crucial. For example, the parameters can be set more easily if the numerical values of the input variables do not vary considerably. Therefore, both the amplitudes and the time values of the WCD-signal traces are re-scaled prior to their input into the network. Also, typical conditions when operating the observatory need to be included in the simulated training data used for adjusting the parameters.
Below we specify the data sets from simulation campaigns, and data with information from the FD used to validate the max reconstruction of the deep neural network.

Simulation libraries
For generating training and validation data we use CORSIKA 7.6400 [15] for the simulation of extensive air showers. The Pierre Auger Offline software [16] is used for the simulation of the detector responses, which is based on GEANT4 [17], and the reconstruction of the air shower parameters [2]. The CORSIKA showers are reused 5−30 times at other locations of the observatory along with a re-simulation of the detector responses. The simulation library consists of events for primary hydrogen, helium, oxygen and iron, uniformly distributed in azimuth angle and following a zenith angle distribution flat in cos 2 with 0 • ≤ < 65 • . The energy range covers 1 − 160 EeV (1 EeV = 10 18 eV) and follows a spectrum of −1 . For the training of the network we use exclusively simulations with the EPOS-LHC hadronic interaction model [18]. Overall, the training set contains around 400,000 events.
For testing the performance we prepare a test set with EPOS-LHC showers containing 50,000 events and two sets of showers simulated using the hadronic interaction model QGSJetII-04 [19] and Sibyll 2.3 [20] containing 450,000 events each. In contrast to the training data for the test sets we limit the energy range to 3 − 160 EeV and the zenith angle range to 0 − 60 • , where the standard SD reconstruction shows no significant reconstruction biases [2]. The extended phase space during training prevents incorrectly reconstructed energies and zenith angles at the edges of the phase space.

Hybrid dataset
The accuracy of the deep-learning based reconstruction of max is examined using a high-quality set of hybrid data where nearly unbiased max measurements are performed using the fluorescence technique. A complete description of the hybrid reconstruction and the high-quality event selection can be found in [3].
Here, events are only kept if max falls into the FD field of view and the fraction of the shower profile inside the field of view is large enough to allow an unbiased measurement of max . Additionally, we discard events below the full trigger efficiency of the surface detector at roughly 3.16 EeV to prevent biased reconstructions. Further, we remove events (less than < 0.5%) with area over peak (A/P) values [21], given in units of 25 ns, which are very low (< 2.7) or high (> 3.5) and cannot be corrected using a linear function (see section 5.1). To guarantee a precise SD reconstruction, we reject events with broken or non-existent stations in the first hexagon formed around the station with the largest signal [22]. After the selection 3109 events, collected from 1 January 2004 to 31 December 2017, remain for the evaluation of the DNN performance.

Pre-processing of data
Footprints of air showers on the ground can reach sizes of several tens of square kilometers. To reduce the memory consumption and increase generalization capacities of the DNN, we use only the information from a fixed size of the SD consisting of a sub-array of 13 × 13 WCDs around the WCD with the highest signal.
For air showers with zenith angles below = 60 • more than 99% of the triggered stations are contained within this sub-array. We expect the effect of enlarging the sub-grid to be negligible as only stations with very low signals lie outside the sub-grid.
As the positions of the WCDs feature a hexagonal grid several representations of the local neighborhood are possible [14]. Our algorithm is based on the so-called axial representation.

Signal traces
The total number of particles measured in each detector is expected to approximate a power law as a function of the distance to the shower axis. Hence, we use a logarithmic re-scaling of the signal traces Here VEM (vertical equivalent muon) is a unit equal to the signal induced by a muon traversing the WCD vertically and norm = 100 VEM normalizes signals of 100 VEM to unity. To allow for positive values only and to leave contributions with = 0 unchanged we use ( ) + 1.

Particle arrival times
The curvature of the shower front is reflected in the arrival times 0, of the first shower particles at the SD stations. For each triggered station in an event, the particle arrival time is normalized with respect to the arrival time center measured at the station with the largest signal and the standard deviation ,train data of the arrival times of the complete training set: Station states Air showers falling close to the edges of the detector grid can have footprints with a substantial fraction lying outside the surface-detector grid. Also, in rare cases, a WCD within the 13 × 13 sub-grid may not provide a signal owing to technical problems. To provide the information in the network training if the absence of a signal originates from air shower physics or from detector effects, we add a feature map for the given sub-grid encoding the station states as additional input (1 = ready to measure, 0 = broken or missing in the grid).
Network prediction During the training of the network the prediction (output) of the neural network is compared to the labels of the dataset in each iteration. As the input values of the network lie approximately in the range of [−1; 1] and the weights of the model are properly initialized, the expected output values of the network follow approximately a Gaussian with standard deviation = 1 and mean = 0.
To speed-up the learning process we also normalize the label values to lie in the same range using standard normalization ( = 1, = 0). Instead of normalizing the label values directly, the normalization is implemented as the final layer in our network for the max output allowing us to monitor the learning process in physics quantities.

Augmentation of simulated data
Depending on the sub-grid, the detector states of the WCDs vary on an event-by-event basis. Furthermore, there are differences between detector states in data and simulation. Even small deviations between simulation and data can have a non-negligible influence on the network prediction if there are differences in the phase space. For example, a broken WCD would create a hole in the measured footprint which then distracts the reconstruction if the model is not trained on such detector conditions. To make the algorithm robust to distortions and fluctuations, we implemented an on-the-fly augmentation. Hence during the training every particular simulated event will feature in each repeated iteration different detector states. Note that this augmentation is exclusively used in the network training procedure and is not applied when evaluating the network.
Photomultiplier tubes without response During detector operation, one or more of the three photomultiplier tubes (PMTs) of a WCD may not respond. To make the network robust to missing signal traces, we randomly mask signal traces during the training. The number of broken PMTs is approximated as Poisson distribution and was tuned to follow the data distribution.
Water-Cherenkov detectors without response As already stated above, a map of detector stations without response is used describing the current station states. During the network training, detector stations are randomly marked as broken, by setting all measured properties to zero, which is additionally accounted for in the map of stations without response. The number of broken stations is approximated as Poisson distribution and was tuned to follow the distribution of measured data.

Saturated signal traces
Shower geometries with a shower core close to a detector station can exceed the maximum signal that can be quantitatively recorded. This effect is even more frequent at the highest energies, where 50% events of air showers with 50 EeV exhibit typically one saturated detector station. Each surface-detector station is calibrated for the value of 1 VEM in units of the hardware every 60 s. This value changes for each detector station and PMT owing to individual hardware properties and other external influences. Since the number of units is limited, the saturation value is time dependent and differs for each detector station [7].
To make the network robust to different saturation values, we sample the maximum measurable values from a Gaussian distribution adapted by typical saturation values observed in data. In detail we make use of an one-sided truncated Gaussian to prevent very small and negative saturation values. Independently for each PMT, we finally clip the unsaturated trace provided by the simulation according to the sampled saturation value.

Deep neural network for reconstructing the shower maximum
Two challenges are to be mastered for reconstructing max from the signal traces of the water-Cherenkov Detectors: 1. To develop a network architecture that is optimally suited for the situation of the Pierre Auger Observatory and exploits all symmetries in the recorded data, 2. To adapt about 10 6 network parameters in a way that the correct shower depth is reconstructed under realistic operation conditions of the observatory.
In the following both the network architecture and the network training are described in detail.

Architecture
To reconstruct the shower maximum max using a deep neural network we designed an architecture featuring hexagonal convolutions [23,24] and recurrent layers illustrated in Fig. 2. We observed that using an architecture which supports the symmetry of the hexagonal grid of the detectors not only boosts the performance, but further makes the network robust to small fluctuations in the data which results in an improved generalization performance. The input for the neural network is a tensor containing the signal time traces, the map of the arrival times and a map of the station states. For connecting the different physics quantities and extracting the maximum information from the data, we adapted state-of-the art methods of computer vision and speech recognition.
We introduce a prior on physics relationships, hence learning first to extract information of the shower development using the signal traces before forming features of space and time. The architecture can be split into three parts, discussed in the following.
Detector-wise recurrent processing of signal traces In the first part of the architecture, the signal trace is processed. To resolve long-and short-term correlations in the signal trace we make use of 1 bi-directional layer and 1 one-directional layer of LSTM cells [25]. Using bidirectional layers allows to connect not only the past time steps with the current step but additionally all future steps.
The most important concept is the station-wise sharing of the recurrent layers. As the physics of the measured traces is the same for each station, we share the trace network over all stations in the 13 × 13 sub-grid. Hence, exactly the same parameters are used in each station.
Note that this concept still allows the network to extract many very different features and not only a distinct characteristic of the trace. Although the same feature is extracted in each station (e.g. slope of the rising edge) each respective feature value will vary (e.g. slope differs in each station) within one event.
In this sub-network, the tensor of signal traces with a length of 120 time steps is reduced to 10 feature maps encoding the longitudinal shower development. Note that these 10 features, which are extracted for each station, are not hand-designed but are learned by the network to be particularly useful for the reconstruction of max . As every station uses the same network, the resulting 10 feature maps share the same physics meaning, allowing to use convolutions in the spatial dimensions in the following part.

Densely-connected hexagonal convolutions
After analyzing the signal trace, the additional information, consisting of the arrival times of the first shower particles at the WCDs and of detector stations without response, is concatenated with the tensor already processed.
In this second part of the architecture, features relating space and time are extracted using convolutional operations on the WCD positions. Before applying hexagonal convolutions the representation needs to be transformed into the space of hexagonal rotations (P6), which is done individually for each feature map. In this space each filter holds 6 feature maps, one for each orientation [24].
The main concept of this block are densely-connected convolutions [26], which provide already extracted features in each subsequent layer. The resulting connections stabilize the training process by feature-reuse and an improved propagation of gradients. The block consists of 3 layers with ELUs [27] (exponential linear units) as activations and with kernels of size 7 covering one hexagon. HexagonalConv "3x3" (13,13,12) Figure 2: Architecture of the deep neural network used to reconstruct max . Convolutional operations are shown with their kernel sizes, e.g. a "3 × 3" hexagonal convolution translates to a kernel covering one hexagon (7 WCD locations). The numbers in brackets denote the output shapes of the operations.

Residual reconstruction modules
The last part of the network features residual modules introduced by ResNet [28]. In the standard layout of the AixNet architecture [29] for each reconstruction task (shower axis, shower core, energy and max ) an individual subpart exists. Each part consists of 2 residual modules with 2 layers each, ReLUs (rectified linear units) as activation functions and pooling operations between the blocks. The final layer of the max block predicts the shower maximum after rescaling back to physics quantities as described in section 2.3.

Training
For training of the network we use 400,000 simulated air shower events with a mixed composition of hydrogen, helium, oxygen and iron (in equal parts). During the training we use the data augmentation as described in section 2.4 to increase the amount of data and to improve the generalization capacities of the network.
To penalize composition biases we use an element-wise mean squared error (MSE) as objective function: Hereˆm ax, is the predicted shower maximum and max, the label, for the respective element . Var( · ) represents the batch-wise variance and · the batch-wise expectation value. The network and the training is implemented using Keras [30] and the TensorFlow [31] backend. We train our network using an Nvidia Geforce 1080 GTX GPU for 150 epochs with a batch size of 48, which takes about 60 h. We further stop the training if the objective function applied to validation data does not decrease after 10 epochs. We use the ADAM optimizer [32] with an initial learning rate of 0.001 and a decay of 2.5 · 10 −6 . Furthermore, we reduce the learning rate by multiplying by a factor of 0.7 if the objective function evaluated on the validation data did not decrease in the recent 4 epochs.
As part of the work, a wide variety of network architectures was tested. These tests included various changes of each of the three network parts. In particular, different types of recurrent layers and various convolutional approaches were tested. The final model which utilizes LSTMs for the processing of signals and hexagonal convolutions for analyzing the spatial relations showed the best performance and stability. In addition, we found that this design is relatively robust to small changes in the architecture. A reasonable change of the hyperparameters, e.g. the number of layers or the number of features extracted by the trace network, had no significant impact on the final result.

Performance on simulations
In this section we study the performance of the deep neural network on the simulated data. Although none of the hadronic interaction models included in the simulations describe the measurements correctly, especially with respect to the muon deficit compared to data [33], we can at least investigate differences arising in the max reconstruction when using different hadronic interaction models that predict different muon numbers. As latest results [34] indicate that EPOS-LHC is able to describe measurements of max better than QGSJetII-04, in this work the network was trained using the EPOS-LHC interaction model exclusively.
In the following we first investigate the results of the deep neural network when reconstructing max of events also simulated with the EPOS-LHC model. Next, to study the influence when modifying the hadronic interaction model, we evaluate the neural network on other hadronic interaction models than those used for the training. We use simulations based on QGSJetII-04 and Sibyll 2.3 as crosscheck in the following. We further tested the inverse setting, not presented in this work, observing that our findings are equivalent when flipping the sign of the biases. Subsequently, we discuss the combined energy and zenith dependency of the max reconstruction of the network, indicating the phase space region that can be used for a high quality reconstruction. Finally, we investigate the reconstructed max distributions.

Training and evaluation of the network using EPOS-LHC simulated events
In the following we show the results of the max reconstruction with the deep neural network as trained on simulations based on the EPOS-LHC interaction model. The events for evaluating the network originate from the separated test set and were also simulated using the EPOS-LHC model. In Fig. 3 the event-by-event correlation of the reconstructed and true max is shown for a benchmark  bin. The Pearson correlation coefficient is above 0.5 for all elements. Comparing Fig. 3b and Fig. 3a implies that element-specific shower-to-shower fluctuations are taken into account in the deep-learning based reconstruction. This results in very different max distributions for proton and iron showers reconstructed by the DNN.
In Fig. 4a we show the max reconstruction bias = max,DNN − max,MC as a function of the cosmic ray energy. Above 10 EeV the bias is below ±9 g/cm 2 independent of the composition of the simulated event set used for evaluation. The statistical errors obtained via bootstrapping are mostly hidden by the markers due to the large statistics of the test set (50,000 events). Thus, the expected precision in the determination of the first moment of max at the highest energies is excellent if the data were to look like the EPOS-LHC simulated data.
In Fig. 4b we show the event-by-event resolution of the max reconstruction as a function of the cosmic ray energy. The resolution improves with increasing energy, but exhibits a clear composition dependency. This dependency is expected as showers initiated by lighter nuclei exhibit larger shower-to-shower fluctuations. At 10 EeV the max resolution for proton-induced showers reaches 38 g/cm 2 which is below the max fluctuations of 40 g/cm 2 as measured by the Pierre Auger Observatory [34]. For iron induced showers the max resolution at 10 EeV is at the level of 20 g/cm 2 only.

Evaluation of the network on QGSJetII-04 and Sibyll 2.3 simulated events
In order to investigate the dependency of the deep neural network on the hadronic interaction model used for simulations, we evaluate our network trained on showers simulated using the EPOS-LHC model on simulations using QGSJetII-04 and Sibyll 2.3. In Fig. 5 we show the event-by-event resolution and bias of the max reconstruction as a function of the cosmic ray energy. In Fig. 5a we show the reconstruction bias as a function of the cosmic ray energy for QGSJetII-04. Above 10 EeV the bias is below ±10 g/cm 2 , however with a shift of approximately −5 g/cm 2 at larger energies. The shift observed for simulations using Sibyll 2.3 (see Fig. 5c) is slightly larger and amounts to roughly −15 g/cm 2 . We assign this effect to the hadronic interaction model as it was not visible in Fig. 4a. Besides the total shift, a slight energy-dependency can be observed when averaging among the compositions of Sibyll 2.3 and QGSJetII-04. Tracing back the differences to individual characteristics of the hadronic interaction models is highly complex since the method is based on time-dependent signals. Therefore, not only the abundance of the individual shower components but also their respective time-dependent shower development needs to be considered.
The resolution of the network consistently improves with increasing energy, but exhibits a clear composition dependency which is the same for both interaction models. At 10 EeV the max resolution for proton-induced showers reaches 38 g/cm 2 and for iron showers 20 g/cm 2 as visible in Fig. 5b and Fig. 5d. These values are close to the result shown in Fig. 4b, so we conclude that the resolution of max is almost independent of the hadronic interaction model.
The element-wise event-by-event correlations are very similar for all elements and energies. This is not surprising since the correlation is connected to the resolution which was found to be independent of the hadronic interaction model.

Zenith dependency of the reconstruction
To investigate the zenith dependency of the reconstruction, the evaluation of the neural network on air showers simulated with EPOS-LHC is presented. For the zenith angle in Fig. 6a, we observe a moderately varying, composition-dependent reconstruction bias of up to ±10 g/cm 2 at = 60 • . Also the max resolution in Fig. 6b exhibits moderate variations with the zenith angle. For protoninduced showers they are around 30 g/cm 2 , and for iron around 20 g/cm 2 . For all elements a particular good reconstruction is visible at ≈ 45 • and a worsening for angles > 55 • and < 20 • . The reason for this behavior is the concurrence of larger footprints and increasing absorption effects of the atmosphere for larger zenith angles, leading to more triggered WCDs but reducing the number of particles measured by an individual detector station. Further, for very vertical showers the shower maximum can be close to the ground or even underground which makes the reconstruction more difficult.
Note that the results shown in Fig. 6 are dominated by events with low energies. To verify a phase-space region in energy and zenith angle for high-quality reconstruction of max we present in Fig. 7 the zenith-energy dependency of the max bias separately for protons and iron nuclei which show the largest bias. To obtain the same number of events in each bin, we plot sin 2 and use logarithmic bins in energy. It is evident that a high-quality phase space at moderate zenith angles exists already at smaller energies, its size increases with energy and allows for reconstructions with a small bias. Outside this region for very vertical and horizontal showers, the zenith bias is particularly apparent for proton primaries, but remains mostly in the order of 15 g/cm 2 per energy bin over the complete energy range. For iron nuclei the bias is even smaller. In addition, vertical proton showers with high energies show an increased bias as for such shower geometries shower maxima below ground are more likely. These found dependencies can also be observed with showers simulated using QGSJetII-04 and Sibyll 2.3.
Overall the reconstruction of max with respect to its bias and resolution performs well on simulations. Above cosmic ray energies of 10 EeV, the expected max reconstruction bias is below ±10 g/cm 2 and the max resolution becomes better than 35 g/cm 2 even for the lightest particles. When investigating events with saturated stations, we do not find noteworthy differences in the network performance.
Although we observe slight differences caused by using different hadronic interaction models, the network reconstruction exhibits only minor dependencies on the interaction models.

Distribution of the reconstructed shower maxima
For estimating the cosmic-ray composition the correct shape of the max distribution is essential. Due to the fact that the DNN is trained using the mean squared error, the predicted distributions are not broadened by the resolution of the method, but tend to be truncated. This bias towards the mean of the true max distribution depends on the performance and creates a dependency on max . Consequently, the quality of the estimator cannot be determined by the resolution only and is therefore examined in detail. As a result, even if the resolution for iron showers is in the range of the shower-to-shower fluctuations, a distribution with correct shape can be predicted if the estimator is sufficiently precise. Precise means in this case that the DNN assigns proton-like values of max to proton-like showers and iron-like values of max to iron-like showers etc.
To investigate the reconstructed max distribution of the network for different energies we show in Fig. 8 the results of the max reconstruction with the deep neural network for two example bins. In Fig. 8a the reconstructed distribution at 20 − 30 EeV for showers simulated with EPOS-LHC is shown. The solid histograms represent the distribution of reconstructed max for proton (red) and iron (blue) showers. The simulated max values are denoted by dashed histograms. It is apparent that both, the proton distribution and the iron distribution, are in good agreement. In Fig. 8b the reconstructed max distribution for showers simulated using Sibyll 2.3 at the same energy is shown. Excepting a total bias of around 15 g/cm 2 which we expect from the observed hadronicmodel bias (see Fig. 5c) the simulated and reconstructed distributions match well. The successful reconstruction of the max distributions can be generalized for energies ≥ 10 EeV. At lower energies the reconstruction biases at low and high zenith angles (compare Fig. 7) superimpose the physics fluctuations. A fiducial selection could potentially reduce this effect. The observation that max distributions similar to the true values are reproduced for all elements and hadronic interaction models (despite an absolute offset), implies that the neural network can be used to gain insights beyond the first moment of the distribution. Beyond the measurement of ( max ), this opens up possibilities for in-depth analyses of the mass-fractions of UHECRs [4].

Application to hybrid data
In this section we evaluate the performance of the deep neural network on events which include measurements of the surface detector and the fluorescence detector. First, we correct for long-term aging effects of the WCDs, which are moderate over the many years of operation, but are still noticeable in the network prediction. Then we calibrate the absolute value of the reconstructed max using hybrid measurements. Finally, we will determine the max resolution to verify the potential for obtaining new information about the cosmic-ray composition at the highest energies from the first two moments of the max distribution.

Corrections for detector-aging effects
The exact shape of the signal traces has a decisive influence on the network predictions (see Fig. 1b). Besides the slightly varying response of each PMT (see section 2.4), over the long operating time of the observatory, aging effects in the signals of the WCDs occur as a combined effect of the water quality, the reflective Tyvek coating, the PMTs, and the electronics. The influence of these effects is continuously monitored by the local calibration using atmospheric muons that constantly pass through the detectors [22]. This allows to quantify measured signals in the unit of vertical equivalent muons (VEM). Every 60 s the calibration parameters VEM and VEM are determined using the signal pulses induced by atmospheric muons. Here, VEM corresponds to the pulse-height induced by a vertical traversing muon and VEM to the integrated pulse (bins of 25 ns).
To correct for slightly different pulse-shapes and effects of aging, the area over peak ( / ) variable has demonstrated its effectiveness [21]. It describes the ratio A/P = VEM / VEM , and hence is a measure of the pulse shape by quantifying the relation between the signal height and decay. Due to the FADC sampling rate of 40 MHz, / is given in units of 25 ns. As first order approximation the area over peak is averaged for each event over all PMTs of triggered stations, which results to a characteristic and event-wise / .
The gradual aging effects of the detectors are measurable in the detector monitoring and the predictions of the neural network. Since the beginning of the operation / decayed from roughly 3.2 to 2.95. During that time, the predicted max of the DNN decayed by about 1 g/cm 2  of the / -values in data and simulations, the average max prediction is increased by roughly 9 g/cm 2 using the / calibration.

Evaluation and calibration of the first moment of the distribution of shower maxima
To further quantify the reconstructions of the DNN and allow for an absolute calibration of max we use hybrid events which offer high-quality fluorescence measurements. In Fig. 10 the event-byevent correlation between the max reconstruction of the DNN and the FD is shown. The Pearson correlation coefficient amounts to 0.63 and remains above 0.6 even if max is corrected for the elongation rate observed by the FD [4].
In Fig. 11 we show the event-by-event difference Δ max = max, DNN − max, FD for 3 example energy intervals as reconstructed by the deep neural network and the FD. All distributions are rather narrow and follow a Gaussian. This observation indicates that the shower-to-shower fluctuations measured with the FD, which feature a Gaussian convolved with an exponential [35], are also covered in the SD-based reconstruction. The observed offset of around −30 g/cm 2 indicates that the predicted max values are too small on average. Hence, the neural network predicts shallower shower maxima. This shift is caused by slightly different shapes of the signal traces in simulations and data. One possible reason is the shower development. In fact, a bias was already present when evaluating the DNN using showers of hadronic interaction models different than used for the training. Also the size of the observed bias is not unexpected, since already previous analyses indicated problems of these models to precisely describe the lateral and longitudinal profiles of the muon component [12,36]. In addition, differences in the detector simulation of the signal characteristics of the WCDs compared to data can contribute to the observed bias.
With modern methods, the differences between data and simulations can be quantitatively determined and eliminated [37]. In that data-driven method the simulation is refined to look more data-like using a semi-supervised learning approach. In contrast, in this work we eliminate differences with a calibration using the independent measurements of the FD. The energy dependency of this reconstruction bias is summarized in Fig. 12a. Within the statistics of the hybrid data used, no significant energy dependency is observed. Only the lowest energy bin at 3 EeV exhibits a slightly smaller bias. Therefore, we calibrate the max value reconstructed by the network with the average bias as measured with the FD: SD max = max, DNN + 30.0 g/cm 2 . (5. 2) The offset was determined by a 1-parameter fit to 30.0 ± 0.6 g/cm 2 and is shown by the horizontal fitted line in Fig. 12a. Even if a constant gives a rather good fit ( 2 /ndf = 1), the first bins show an increased deviation from the FD measurements, which can be explained by biased reconstructions at low energies (compare Fig. 7). If these bins are not taken into account, the fit remains stable and moves by roughly −1 g/cm 2 .

Potential for determining the second moment of the distribution of shower maxima
In this section we will evaluate the resolution in the max reconstruction of the deep neural network using the hybrid events.
In Fig. 12b we show the standard deviation (Δ max ) of the event-wise differences between the max value as reconstructed by the deep neural network and the max value as measured by the FD. The symbols are located at the average energy of all events within the corresponding energy bin as indicated by the grey horizontal bars.
The vertical error bars indicate the uncertainty of (Δ max ) combined for both reconstruction methods of the network and the FD. In detail, this uncertainty was calculated by a bootstrapping method using 1000 random draws from the Δ max distribution and calculating the standard deviation for each set. The energy dependency of the combined max resolution is obtained by fitting an exponential function Δ max ( ) = · − ·(log 10 /eV−18.5) + to the data, which is shown as a red curve. The values of the coefficients are = 18.0 ± 2.5 g/cm 2 , = 2.9 ± 1.2 and = 27.7 ± 2.6 g/cm 2 . The max resolution [3] of the FD is indicated by the dashed grey curve. To obtain an estimate of the max resolution of the deep neural network, we perform a quadratic subtraction of the FD resolution (dashed grey curve) from the combined max resolution (solid red curve). The resulting max resolution of the deep neural network using only the measurements of the WCDs is shown as dashed red curve in Fig. 12b. We find that the resolution on the measured data reaches less than 25 g/cm 2 above 20 EeV, which is in the same order of magnitude as predicted by our simulation studies (Fig. 4b) reinforcing the finding that the resolution seems to be independent of the hadronic model. This will enable new insights into the cosmic-ray composition at high energies.

Summary
In this work we presented a new approach for reconstructing the maximum shower depth max using only the signal traces of the water-Cherenkov detectors (WCDs) placed on the ground, which record a tiny subset of the billions of shower particles. It was shown that the presented method is capable of exploiting the data measured by the WCDs more comprehensively than ever before by adapting deep learning techniques, resulting in an unprecedented performance for mass composition studies using the surface detector (SD). The new method yields both the large statistics of the SD and a measurement of max on an event-by-event basis. In consequence, our method opens up the possibility to measure the abundance of mass groups of UHECRs at 100 EeV and beyond for the first time on data.
As reconstruction method, we have developed an advanced deep neural network which is especially suited for the situation of the Pierre Auger Observatory. The signal traces of the WCDs are analyzed by the network using so-called LSTM cells and their measurements are combined according to the hexagonal symmetry of the detector grid.
A key issue to correctly adjust the network parameters is the proper preparation of the data used for the network training. In addition to re-scaling and normalization of the signal amplitudes and time measurements, we implement real operation-conditions in the simulation data as data augmentation during the training. This includes missing WCDs due to hardware failures or showers falling close to the edges of the detector grid, missing signal traces of single photomultipliers and detector stations with saturated signal traces owing to high-energy events or very close shower cores. By including such effects, we make the network robust against small differences between simulation and measured data, enhancing its generalization capacities and providing an accurate reconstruction of max for zenith angles up to 60 • and even for events with saturated station electronics.
Initially we evaluate the performance of the network on simulated data. When evaluating the network using disjunct data from the same simulation as used for training, we observe an almost bias free reconstruction of max . The max resolution improves with increasing cosmic ray energy and is composition dependent. For proton-induced showers the resolution is 38 g/cm 2 at 10 EeV and improves to 28 g/cm 2 at the highest energies. For iron primaries the resolution is better than 20 g/cm 2 above 20 EeV.
When changing the hadronic interaction model for the evaluation of the network, we find that only the absolute bias in the max reconstruction changes. In contrast to this shift in max , the resolution of max appears to be essentially independent of the hadronic interaction model. Additionally, we found that the network is able to reproduce the correct shape of the max distribution for all elements and hadronic interaction models at energies above 10 EeV.
Finally, we test the network performance using events which include measurements of the fluorescence detector. We first eliminated effects of detector aging from long-term operation of the observatory. Compared to the results of the fluorescence measurement a remaining shift of the reconstructed max value of the network of about 30 g/cm 2 was found. The shift is independent of the cosmic ray energy allowing for a straightforward calibration. The observed shift is larger than the constant shift of up to −15 g/cm 2 observed in the application of different interaction models. However, since all these models are not able to precisely describe the muonic component of air showers, a shift in the observed magnitude is not unexpected. Additionally, inaccuracies in the detector simulation could contribute to the observed bias.
We then estimate the max resolution of the network on data by subtracting the max resolution of the fluorescence detector in quadrature. We obtain a resolution of about 25 g/cm 2 above 20 EeV which is well compatible with the resolution expected from our simulation studies.
Thus, a high statistics measurement of the first moment max and the second moment ( max ) using the deep neural network reconstruction of the WCD-signal traces has a great potential to provide new insights into the cosmic-ray composition at the highest energies.