High throughput software-based gradient tree boosting positioning for PET systems

The supervised machine learning technique Gradient Tree Boosting (GTB) has shown good accuracy for position estimation of gamma interaction in PET crystals for bench-top experiments while its computational requirements can easily be adjusted. Transitioning to preclinical and clinical applications requires near real-time processing in the scale of full PET systems. In this work, a high throughput GTB-based singles positioning C++ implementation is proposed and a series of optimizations are evaluated regarding their effect on the achievable processing throughput. Moreover, the crucial feature and parameter selection for GTB is investigated for the segmented detectors of the Hyperion IID PET insert with two main models and a range of GTB hyperparameters. The proposed framework achieves singles positioning throughputs of more than 9.5 GB/s for smaller models and of 240 MB/s for more complex models on a recent Intel Skylake server. Detailed throughput analysis reveals the key performance limiting factors, and an empirical throughput model is derived to guide the GTB model selection process and scanner design decisions. The throughput model allows for throughput estimations with a mean absolute error (MAE) of 175.78 MB/s.


Introduction
Positron emission tomography (PET) is a functional, tracer-based, non-intrusive imaging technique with broad applications in both the clinical and preclinical domain (Myers 2001, Phelps 2004. During PET scans, a ring of radiation detectors registers two incoming gamma particles originating from an electron-positron annihilation, which occurs due to the radioactive b + -decay of the used tracer. Scintillation crystals convert these gamma particles into optical photons, which are recorded by photosensor arrays. Based on the light pattern and timestamps registered by the photosensors, representing the raw sensor data, physical meaningful information about the gamma photon interacting needs to be derived. Typically, the spatial position and timestamp of the gamma photon interaction, as well as the amount of deposited energy in the scintillator, are calculated. This information is then used as an input for an image reconstruction algorithm. In previous work, we presented a fully integrated, software-based data acquisition and processing architecture (DAPA) (Goldschmidt et al 2013. The data acquisition and processing server (DAPS) receives unprocessed raw photosensor data of all connected radiation detectors and either processes singles or coincidences in real-time or saves all data to drives for later offline processing. The software-based implementation allows for high flexibility and reuse of existing softwarebased solutions. Furthermore, developed processing schemes are usually much simpler to port in a highlevel programming language compared to hardware description languages as used for FPGAs.
The main processing steps of the current DAPSimplementation (see figure 1) include initial corrections of the raw photosensor data and clustering of independent photosensor channel information. Interaction position and deposited energy are estimated followed by the coincidence processing. This work focuses on the most time-consuming step of estimating the position of a single, which takes up around 50% of the total runtime. Additionally, the DAPS framework provides optional coincidence processing after the cluster process to reduce the subsequent workload, which was not used in this study.
Several data-driven modeling algorithms have been evaluated regarding their predictive capabilities to perform positioning of gamma interactions, especially in the domain of monolithic scintillator concepts (Lerche et  . Statistical models provide promising usecases for detectors with segmented crystals as well, e.g., complex multi-layer staggered scintillators or local light-sharing approaches between single scintillator elements. Especially, the incorporation of depth of interaction (DOI) resolving capabilities may significantly increase the complexity of algorithms and processing times. When progressing from benchtop experiments to full PET rings and production systems with near real-time throughput requirements, the runtime remains as one main challenge for the wide application of statistical positioning algorithms.
Previous work has shown promising results for overcoming the runtime challenges while providing high spatial resolution through GTB methods (Müller et al 2018(Müller et al , 2019. GTB is a supervised machine learning technique that utilizes an ensemble of decision trees for classification or regression. While GTB predictors have been shown to require considerably less calibration data as opposed to other methods, GTB also scales well to three-dimensional positioning as its runtime only increases linearly with dimensionality (Müller et al 2018). For each positioning dimension, a separate GTB model is trained offline with a detector-specific calibration data set created by irradiating the detector at known positions using a dedicated coincidence collimator setup, e.g. a fan-beam collimator as described by Hetzel et al (2020). Furthermore, the GTB models can be configured via a set of hyperparameters, which control specific aspects of the training algorithm. A large space of parameter values allows for a trade-off between positioning accuracy and model complexity, which dictates computational requirements.
In this work, we demonstrate the real-time processing capabilities of GTB-based singles positioning algorithms based on offline raw data of the Hyperion II D scanner . We extend the SCP framework (Goldschmidt et al 2013 to the GTB method presented in (Müller et al 2018(Müller et al , 2019 which acts as a drop-in replacement for the existing position estimation methods. Required GTB models were established based on a bench-top experiment with a single detector stack. The implementation is fully parallelized and optimized to handle online processing. To allow for an informed choice of suitable GTB parameter settings, we compare the processing throughput of different GTB model configurations and investigate the key factors which limit the processing throughput. Processing steps performed by the software-based singles and coincidence processing (SCP) framework. Incoming data of every detector stack are fully parallized processed including raw data corrections and clustering of independent photosensor channel information related to a single gamma interaction. For positioning, several methods can be selected including centre of gravity (COG), maximum likelihood (ML), and gradient tree boosting (GTB). In case of GTB, the positioning models are created based on an benchtop calibration and loaded into the DAPS framework.
These findings generalize beyond the specific characteristics of the used scanner and computing architecture. To estimate the achievable throughput of a given GTB model for hardware procurement, resource allocation, and similar scanner design decisions, we derive an empirical throughput model.

Hyperion II D insert
As a detailed description of the Hyperion II D PET insert can be found in Weissler et al (2015), Hallen et al (2018), a short overview of the main components required for the data processing chain is given: The Hyperion II D insert consists of ten singles detection modules (SDMs) mounted circularly on an MRcompatible gantry. Every SDM hosts six detector stacks in a 2×3 arrangement as shown in figure 2 and provides the infrastructure to operate the detector stacks, e.g., power distribution, liquid cooling, and RF shielding for operation inside the MR scanner. Every SDM is connected via User Datagram Protocol (UDP) to the DAPS employing an optical gigabit Ethernet link collecting and sending the raw data of all detector stacks. The Hyperion II D insert and the DAPS are controlled via a control PC connected to the DAPS. A separate synchronization unit distributes the reference clock and trigger signals.
Previous investigations (Goldschmidt et al 2016) found a maximum sustainable data rate of 95MB/s per SDM after which data messages of the detector stacks are randomly discarded by the SDM and not sent out. Measurements with hot rod phantoms and cube sources revealed a nearly linear relation between activity and data input rate of 15.7 MB/(MBq s) to 19.1 MB/(MBq s). Depending on the configurations of the sensor tile, the detector stack itself saturates in the region of 40 MBq to 45 MBq. The specific configuration settings of the Hyperion II D PET insert as well as details of the used phantoms are presented in Goldschmidt et al (2016). These estimates lead to a maximum data rate of the Hyperion II D insert of around 860 MB/s until the detector stacks saturate and approximately 950 MB/s for the SDMs. Although the presented data rates may vary with different system configurations and are approximate numbers, they allow an informed assessment of the following data throughput rates.

Detector stack
The detector stacks consist of a segmented scintillator array coupled via a lightguide to a sensor tile (see figure 2). The scintillator array is built up out of 30×30 optically isolated LYSO scintillator crystals with a pitch of 1 mm and 12 mm height. Optical photons are spread by the 2 mm thick lightguide to multiple read-out channels of the sensor tile. The sensor tile is custom-made variant made up of 16 digital SiPMs (DPC 3200-22, PDPC) with 2×2 readout channels each resulting in 64 read-out channels and 16 time-to-digital converters, which generate the timestamps. Further information regarding the DPC 3200are presented in Degenhardt et al (2009), Frach et al (2010.
Every DPC of the sensor tile acts as an independent unit following a two-level trigger scheme: After the first trigger condition, which logically connects 4 subregions of single read-out channels, is met, the DPC generates a timestamp and enters the validation phase of programmable length. The configurable validation criterion acts similarly to the trigger logic and is usually set to a higher level of required optical photons. Subsequent to a successful validation, the DPC starts the integration phase and, afterward, sums up all registered optical photons. Otherwise, the DPC resets and waits for the next trigger signal. After the integration phase, the photon counts of all four read-out channels of a DPC and their common timestamp are sent out.

Single-detector calibration setup
Data sets for training and testing the GTB models with known irradiation positions were established using a bench-top coincidence calibration setup described and characterized in detail in Hetzel et al (2020). The main components of the setup are a fan-beam collimator with adjustable slit width hosting two 22 Na sources and an electrically driven translation stage defining the irradiation positions (LIMES 90, Owis, Staufen im Breisgau, Germany). Both sources had an activity of 5.5 MBq and the slit width was set to 0.4 mm for the calibration detector. The detector under test was placed on the translation stage and the coincidence detector on the opposing side. We utilized the PDPC technology evaluation kit (TEK) as a readout infrastructure for the bench-top experiment. Therefore, we mounted an identical scintillator array and lightguide to the compatible PDPC DPC-3200 sensor tile. The whole setup was placed in a light-tight temperature chamber keeping the air temperature at constant 10°C.

GTB-based positioning models
A detailed description of the GTB algorithm for predicting both planar and DOI positioning using monolithic scintillators is available in Müller et al (2018Müller et al ( , 2019. As the method does not differ for this work, only the algorithm's main characteristics and its hyperparameters used in this study will be summarized. The illustration in figure 3 outlines the central components of GTB-based positioning. As part of supervised machine learning techniques, GTB builds a predictive positioning model based on a training data set with known irradiation positions. GTB models can act as classifiers or regressors. As previously shown, regression models reduce the number of known classes in the training data set, and, thus, the number of irradiation positions in a calibration process (Müller et al 2018). Therefore, only regression models are used in this work. The GTB algorithms accept incomplete data for training and evaluation as well as arbitrary input features (Chen and Guestrin 2016). During training, GTB builds a set of weak prediction models that consists of a series of binary conditions (decision trees) (Kotsiantis 2013, Natekin et al 2013). The ensemble of decision trees is trained subsequently: While the first decision tree is based on the known irradiation positions, every following decision tree corrects the residuals of the former ensemble.
Three hyperparameters mainly influence the positioning accuracy of the used GTB model: (i) Higher number of trees in the model linearly increases the model's complexity and, thus, enables more accurate prediction at the risk of overfitting the training data.
(ii) Higher maximum tree depths have the same qualitative effect on the model's positioning accuracy as adding additional trees but exponentially increase the model's complexity.
(iii) To counteract overfitting, the applied learning rate may be decreased during training, thereby reducing the influence of a single tree on the final prediction result. Smaller learning rates may cause individual trees to grow deeper while still respecting the given maximum tree depth.
As shown in previous works, the positioning accuracy of GTB models can be influenced significantly by adding information as further input features to the photon counts (Müller et al 2018(Müller et al , 2019. This either improves positioning accuracy given the same hyperparameter settings or allows for smaller GTB models, which achieve the same positioning accuracy. For this work, the following two input feature sets were considered: (i) The first feature set ( 1 ) only contains the recorded photon counts of each pixel. As the DPC and pixel with the highest number of counted photons are already calculated during previous steps of the DAPS processing pipeline, their indices were also added.
(ii) For the second feature set ( 2 ), the COG as well as the row-and column-wise photon count sums of the DPC pixels were appended to the feature vector.
While extra calculations are required for storing the additional feature values of  2 models, these features carry more information than the photon counts of individual pixels in  1 .

Data generation and preprocessing
The calibration process was based on the singledetector calibration setup utilizing the fan beam collimator. We irradiated the scintillator array centrally at every line of the 30 scintillator segments, following the pitch of 1 mm. Before this measurement, the scintillator array's edge was located by step-wise moving the detector into the beam as demonstrated in Ritzer et al (2017). As mentioned earlier, the GTB algorithm was strictly used as a one-dimensional positioning approach. Thus, the radiation detector was rotated by 90°and the process was repeated. The recorded raw data were preprocessed following the same scheme as shown in figure 1 employing an offline processing tool described in Schug et al (2015). All hits of read-out DPCs were merged to clusters using a cluster window of 40 ns. The timestamp of the cluster was defined by the earliest timestamp of all related hits. Coincidence clusters were searched using a sliding coincidence window of 10 ns. No further quality cuts were applied to the data. As every DPC of the sensor tile is independent, not all available DPCs validate and sent out data. The clusters are arranged in vectors of 64 integers marking missing data as -1 to allow a distinction from zero photon counts.

GTB model generation
GTB models were trained using the XGBoost package (Chen and Guestrin 2016) which is integrated in a self-developed Python-based processing framework. Per irradiated line of scintillator segments, we selected 20000 gamma interactions for the training data set, leading to 600000 clusters in total for both xand y-direction. The GTB models were validated and tested using validation and data sets containing in total 180000 cluster per spatial direction. The labels used for the GTB positioning models described the line of scintillator segments encoded by an integer number ranging from 0 to 30. The GTB regression models output a continuous estimation of the gamma interaction's scintillator line. Thus, the output value was mathematically rounded in contrast to previous work. In case only integer values are used for the GTB evaluation within the DAPS framework (see section 3.3.4), the only continuous feature COG of feature set  2 was multiplied by 100 prior to an integer cast retaining sufficient precision for these features. All trained GTB models were qualified by calculating their predictive performance using the test data set. We used the prediction accuracy, defined as the percentage of identifying the exact scintillator line corresponding to the irradiation position, as the evaluation metric. Due to Compton scatter inside the scintillator array, prediction accuracy of 1 is not achievable. Nevertheless, the prediction accuracy allows for consistent and easy comparisons between the positioning models.

GTB-based singles positioning
Focusing on the main singles processing task of position estimation, a new GTB-based singles positioning algorithm extends the existing DAPS software by integrating into its extendable class hierarchy. An abstract base class for all positioning algorithms holds necessary members for retrieving information about the used scanner arrangement and general configuration options common to all positioning algorithms. Appropriate child class methods for algorithm-specific options and actual position calculations are invoked via virtual function calls. The new GTB-based singles positioning (Müller et al 2018(Müller et al , 2019 can therefore be chosen as an alternative to a COG calculation (Anger 1958) or a ML search , Gross-Weege et al 2016. Due to the framework's modular design, the remaining parts of the processing chain do not require any modifications. Therefore, in the following, 'throughput' refers only to the throughput rate of the singles processing step as the other processing steps remained constant for all code versions.
As GTB positioning is performed by feeding feature vectors into pre-trained GTB models, the filenames of these models need to be specified in the application's configuration file. For each detector stack and positioning dimension, a separate model is read in allowing for high flexibility in the chosen models and fine-grained control over the positioning accuracy. As described in section 3.1, feature set  2 require calculations based on the measured photon counts. These calculations are performed within the GTB class to complete the input vector and are included in the throughput measurements.

GTB prediction implementations
The following sections present multiple runtime optimizations to accelerate GTB predictions. While the initial baseline utilizes library exposed application programming interface (API) functions to load GTB models into in-memory data structures at runtime, the subsequent modifications build on the idea of prior model compilation introduced by Asadi et al (2014). All of the introduced changes maintain the positioning accuracy of the pre-trained GTB models.

XGBoost online prediction (xgboost)
Since the GTB models were trained using XGBoost (Chen and Guestrin 2016), online prediction was integrated by invoking appropriate XGBoost CAPI functions. As no dedicated single instance prediction routine is exposed, the data structures need to be rebuilt repeatedly for each positioned single. To constraint the regression outputs for the x and y coordinates to valid crystal indices, the resulting GTB predictions need to be rounded and clipped accordingly. The associated code is shown in listing 1. Alternatively, the GTB models could be trained as classifiers to eliminate the need for post-processing the predictions.

Treelite model compilation (treelite)
To overcome the above API limitations and implement a technique presented in related research labeled CodeGen (Asadi et al 2014), the Treelite library for the compilation of tree-based machine learning models was used (Cho and Li 2018). Pre-trained GTB models are fed into a Treelite Model, which generates C source code consisting of a sequence of nested ifelse-conditions. After compilation to a shared object file, the model can be loaded during program initialization and used for later singles positioning via the exposed Treelite C API as displayed in listing 2. Compared to XGBoost, Treelite allows for single instance predictions and direct manipulation of feature values in the passed data structure, thus requiring fewer allocations and deallocations.
Listing 1. GTB Prediction using XGBoost's C API. 3.3.3. Primitive data type conversion (float) In Treelite, feature values are implemented as unions. Missing features are indicated by setting the missing field to −1 and checking its value during prediction. These additional operations are eliminated by directly assigning a floating-point value of −1 to each missing photon count in the feature vector. This behavior has to remain consistent during training and prediction. As a consequence, the feature array only holds floating-point values and can directly be used for prediction. The implementation of this approach requires the transformation of the prediction interface. Instead of using Treelite's functionality, the symbol of the predict function is looked up manually and called directly during prediction.

Float to integer quantization (int)
By converting floating-point feature values to an integer data type, prediction throughput may be increased further (Cho and Li 2018). On x86 architectures, integer comparisons are advantageous since constant values may be encoded into the operation directly. As an effect, the processor does not need to execute an additional load instruction to retrieve the integer constant. Ultimately, fewer instructions are required and the executable code size is reduced.
In general, this transformation entails overhead to guarantee lossless conversions without affecting the positioning accuracy of the GTB model. However, as input features for singles positioning only consist of discrete photon counts, the threshold in every branching condition may be replaced by its ceiling when using less-than comparisons. The only exception is the non-discrete COG coordinates of  2 , which were multiplied by 100 before the integer cast, to retain sufficient precision for these features.

Conditional branch annotation (annotated)
To further reduce the runtime required for the evaluation of the models, branch annotations may be introduced into the generated Csource code before compilation. They provide additional information to the compiler regarding likely execution paths. Through branch reordering in the generated binary, data locality may be improved and branch predictions may become more accurate. This in turn improves instruction pipelining where incoming instructions are split into sequential steps. Fewer pipeline flushes occur which allows the waiting time for fetching memory due to memory access latencies to be overlapped with the execution of other instructions more efficiently.
Treelite's built-in features are used (Cho and Li 2018) to generate annotations. The GTB model is evaluated on all training samples and executed paths are recorded individually for every branching condition. Conditions are marked as 'likely' if rendered true more than 50% of the time. The training data is assumed to approximate the online processed data with sufficient accuracy and, therefore, the provided annotations hold in the majority of cases.

Benchmarking methodology
For the evaluation of the implemented optimizations, a series of benchmarks were carried out on the RWTH compute cluster 9 . Each compute node is equipped with two Intel Xeon Platinum 8160 Skylake central processing units (CPUs) clocked at a base frequency of 2.1 GHz and 192 GB of main memory. The (CPUs) were allowed to increase their clock frequency if their temperatures allow for it (turbo boost). Each CPU features 24 physical cores while their hyper-threads were disabled. The cluster was running CentOS Linux release 7.6.1810 during the tests and the code was compiled with the Intel C/C++ Compiler 18.0.3 with the optimization options -g -O3 -xHost. The measurements cover increasing process counts scaling from a serial execution to a fully utilized compute node. To ensure reproducibility and reduce throughput variability, executions for each configuration were repeated 20 times with each process assigned to a specific core (process binding). The median of these measurements was used as a sensible figure of the expected throughput.
Previous research (Müller et al 2018(Müller et al , 2019 hinted towards reasonable parameter settings for achieving acceptable spatial resolution. Four GTB models ranging from low to high complexity due to their increasing number of trees and maximum tree depth were chosen to evaluate the impact of the implemented runtime optimizations on the throughput:

Empirical throughput model
An empirical throughput model allows for an informed GTB model choice and guides the computing hardware procurement. A set of sensible parameters characterizing the achievable throughput of a given GTB model is presented and validated based on the findings of a low-level throughput analysis. These parameters can even be computed for previously untested GTB models. Therefore their throughput can be estimated without conducting runtime measurements.
Four main characteristics were considered to influence the parallel throughput of a GTB-based prediction model: (i) The average number of evaluated conditionals per prediction can be estimated by evaluating the GTB model on its training data and counting the number of samples that end up in each specific tree leaf. This distribution is aggregated into an average leaf depth and multiplied by the number of trees in the GTB model to get the desired figure (conditionals).
(ii) Memory related operations required for loading the binary instructions from the compiled GTB model are estimated by the GTB model's byte size on disk (memory).
(iii) Modeling the exact behavior of hardware-based branch predictors would provide the most accurate branch misprediction ratio estimates. However, due to their growing complexity and dynamic behavior in modern processor architectures (Uzelac and Milenkovic 2009) this approach is infeasible in this context. Instead, a model's branching behavior is represented by calculating how often the added compile time hint of the annotated version is incorrect by reusing the annotation data, which was generated for each conditional, and averaging it across the whole model (hint). The resulting value represents a penalty for non-contiguous code execution.
(iv) The number of executing processes (processes) is set during application execution and can, therefore, be used in the throughput model directly.
In the following, this modeling approach is referred to as the universal throughput model. For an alternative throughput model, which only uses the initially supplied GTB model parameters, some of the above characteristics can be approximated or represented differently. The average number of evaluated conditionals per prediction is approximated by multiplying the maximum tree depth (depth) by the number of trees in the GTB model (trees). The GTB model's binary size is approximated by multiplying 2 depth by trees. It is not possible to capture the branching behavior of GTB models only using its learning rate (learning) as this would not account for different input features. For this limited use case, we can indicate the used feature set  1 or  2 by setting features to 1 or 2 respectively. In the following, this modeling approach is referred to as the narrow throughput model.
Both throughput models were evaluated and compared. For both approaches, the benchmarking data described in section 3.4 was used to generate a linear least square regression model for the throughput of singles processing.

Results
For all but four of the collected measurements shown in figure 4, the 95% confidence interval fell within 10% of the reported medians such that the data was summarized in the following by taking the median throughput rate of all repetitions. The remaining four results showed no abnormal runtime behavior or outliers and were thus treated in the same manner. Percentages were calculated by taking the median percentages of all process counts, model sizes, feature sets, or code versions if appropriate.

Feature usage in GTB models
For all tested GTB parameter configurations, the  2 models deliver higher positioning accuracies compared to the  1 models with the same hyperparameter settings as indicated in figure 5. This is achieved through the additional features to which  2 models have access during prediction namely the 'cog' and 'sum' data. The tiny  2 model largely uses these calculated features to significantly increase its positioning accuracy above the  1 model's accuracy. With growing model size, the usage of the raw photon counts increases for both feature sets to allow for finegrained adjustments to the estimated interaction position.

Throughput by model complexity
For all measured configurations, the achievable processing throughput declines with growing GTB model complexity. While a maximum throughput of 9.6 GB/ s is possible when using the tiny model, the throughput peaks at 240 MB/s for the large model. The small, medium, and large models achieve 23.6%, 10.8%, and 3.2% of the tiny model's throughput respectively.
When comparing the achievable throughput of an  1 model to the  2 model with the same hyperparameter setting, the collected data shows that in all cases the  1 model can deliver a higher throughput. Overall,  2 models achieve 81.3% of the throughput of  1 models.

Throughput by code version
For all measured configurations, the achievable processing throughput increases when using compiled GTB models compared to the xgboost version. Exceptions occur when using the large model and more than 24 processes where the treelite and float versions fall below the xgboost version. The treelite version achieves 465% of the xgboost version's throughput.
The float version achieves 108% of the treelite version's throughput. The int version improves the throughput further as it achieves 135% of the float version's throughput. While the annotated version only reaches 95.1% of the int version's throughput at worst, in the median its throughput reaches 107% of the int version's throughput.

Parallel scaling behavior
During parallel processing, all code versions achieve a sublinear median speedup-calculated as parallel throughput divided by serial throughput-between 25.7 and 36.5 in case of a fully utilized compute node with 48 processes using the tiny, small, and medium models. This corresponds to a parallel efficiencycalculated as speedup divided by number of used processes-between 0.54 and 0.76. For the large model, the xgboost version achieves a perfect speedup in the majority of cases. The only exception is the large  2 model, for which an execution with 48 processes only achieves a parallel efficiency of 0.5. The highest throughput of all other code versions is achieved when partially utilizing the available processor cores by executing with fewer than 48 processes. The process threshold where throughput begins declining is specific to each code version. The earliest drops simultaneously occur for treelite and float which are followed by int and lastly annotated. . Singles processing throughput of the five presented code versions using four differently sized GTB models each with  1 and  2 . In each subplot the same model is used for all code versions; thus their positioning accuracy is also identical. Figure 5. The mean relative frequency of each feature group occurring in a split condition for four differently sized GTB models each with  1 and  2 . The groups 'raw', 'max', 'cog', 'sum' contain the raw photon counts of each pixel, the max die and max pixel, the center of gravity, the row and column sums respectively. The frequencies are calculated by counting the split conditions which contain an index of the respective group divided by the number of total split conditions in the GTB model. The mean is derived by dividing by the number of individual indices in the respective feature group. The positioning accuracy of each GTB model configuration is indicated as ACC in the top left of each subplot. 4.3. Processing throughput analysis 4.3.1. Accessed memory levels As the memory access latency varies between the different cache levels and main memory, the accessed level in the memory hierarchy has a substantial impact on the runtime of latency bound code regions. The results of the lat_mem_rd benchmark of the lmbench3 benchmark suite (McVoy and Staelin 1996, Staelin 2002 in table 1 show that latencies for these memory accesses can be reduced by a factor of 4 or 5 depending on the memory level.
This does not only impact serial runtime but also parallel executions. While the first and second cache levels are managed on a per-core basis, the last level cache is shared between all cores of the investigated CPU.
The visualization of the benchmarking data in figure 6 highlights the relationship between the memory level a compiled GTB model fits into and its achievable throughput. The four groups of curves correspond to the four different model sizes. While the tiny models fit into the L1 cache, the small and medium models fit into the L2 cache for all code versions and both feature sets respectively. However, for the large model all code versions and both feature sets eventually fill up the L3 cache at high process counts and models reside in main memory.
This shows that as long as one GTB model per executing process fits into at least the last level cache, throughput improves when adding more processes. Note that each executing process actually accesses two separate GTB model during each positioning calculation-one for each positioning dimension. However, since only parts of a GTB model need to be loaded depending on the taken branches, using the average size of just one GTB model per executing process leads to good approximations. When exceeding the described cache size threshold, achievable throughput rates may decline due to cache thrashing as the executing processes compete for the same limited cache storage when loading GTB models for prediction and, therefore, repeatedly need to access data from main memory. In these cases, the utilization of multiple threads per process enables the sharing of cache data, specifically of the GTB models. This could increase parallelism beyond the maximum number of processes and may lead to additional throughput improvements benefiting from the reuse of shared cache data. An evaluation of this proposition remains subject of future work.

Evaluating branch prediction
For tiny, small, and medium models, adding branch annotations only has a negligible impact on the achievable throughput. However, for large models, the annotated version improves the throughput beyond the optimizations of the int version. Measuring the branch misprediction shows that in case of the large  1 model the misprediction ratio is reduced from 6.5% to 6.3% and in case of the large  2 model the misprediction ratio is reduced from 7.3% to 6.6%. Consequently, the hardware prefetcher of the CPU can prefetch the correct data more frequently, which results in a measurable reduction of data traffic between the L2 and L3 cache. For the large  1 model, the data movement is reduced from 16.6 TB to 11.5 TB while for the large  2 model, it is reduced from 25.7 TB to 15.1 TB. Due to the improved accuracy of the prefetcher, memory access latencies can be hidden more frequently resulting in higher throughput. Therefore, the annotated version of Figure 6. Interaction between the memory consumption of compiled GTB models and the achieved processing throughput with one curve per code version, feature set and model size combination. the models remains the preferred choice for further applications.
As the throughput of  1 models surpasses the one of  2 models with the same hyperparameter settings, their branch misprediction ratios can be compared as shown in table 2. Again, lower misprediction ratios coincide with higher throughputs.

Empirical throughput model
Due to its higher throughput, only the annotated implementation is used in the following.
The benchmarking data described in section 3.4 is used to generate a linear least square regression model for the throughput of singles processing. The formula for the universal throughput model is: For the measured data, a mean absolute percentage error (MAPE) of 17.33% with a mean absolute error (MAE) of 175.78 MB/s is achieved. As this regression model does not account for the throughput limitations imposed by the restricted L3 cache size, its accuracy degrades for GTB model configurations with a maximum depth of 10. Only considering GTB models which offer sufficient positioning accuracy and processing throughput, thus ignoring configurations with the largest and smallest tested parameter settings for the number of trees and the maximum tree depth, the MAPE and MAE of the same regression model are significantly lowered to 10.59% and 106.18 MB/s respectively.
Utilizing the previously described approximations for each term in the narrow throughput model leads to the following regression model, which uses a nested regression model for the memory footprint:

Discussion
The presented implementation achieved data throughput rates between 240 MB/s and 9.6 GB/s depending on the GTB model for the investigated processing steps. When comparing these results to the throughputs of 12 350 MB/s and 3650 MB/s achievable by COG and ML positioning respectively on the same reference machine, this proves the feasibility of GTB-based real-time processing on the scale of PET systems considering the maximum data rate of Hyperion II D insert of less than 1 GB/s. Furthermore, the DAPA allows the potential application of this technique for even larger PET systems while the presented insights help to match PET hardware and required processing architecture. When selecting a specific GTB model for a given processing task, choices are mainly limited by two considerations: The desired positioning accuracy directly impacts the image quality, mainly in terms of the signal-to-noise ratio (SNR), and the required processing throughput to allow for real-time data processing. These two aspects are separately investigated in the following while the findings are combined afterwards to guide an informed GTB model choice.
As each process receives its own set of irradiation events, singles positioning is performed independently without the need for synchronization. However, the expected ideal scalability does not hold in all cases as the measurement results infigure 4 show. The irregular memory access pattern introduced by the chain of conditional jumps limits the runtime of the GTB prediction to the latency of the memory subsystem. Therefore, its runtime is mainly influenced by the accessed memory level and pipelining efficiency as detailed in the following.
As the memory access latency varies between the different cache levels and main memory (see table 2), the accessed level in the memory hierarchy has a substantial impact on the runtime of latency bound code regions and, thus, on the processing throughput of GTB models. When accessing the input feature vector, it will reside either in registers or the L1 cache due to its small size. However, while the compilation of the GTB model speeds up its execution significantly in the majority of cases, the resulting control flow dependencies introduce jumps through the dynamically loaded binary. Depending on the GTB model parameters and the code version, its binary's size can range from 20 kB to 7 MB. The achieved throughput improvement from the float and int versions can be attributed to a smaller model binary size. While the prior eliminated one conditional check per if-condition, the latter optimization targeted specific limitations of the x86 instruction set thus further removing one load instruction per if-condition as described in section 3.3.4. As an effect, instructions are more often loaded from lower levels in the memory hierarchy.
To further improve the processing throughputs, one would need to lower the memory consumption of the GTB models beyond the already applied optimizations in this work. This includes the reduction of the number of distinct branching conditions through the union of similar subtrees (Nakamura and Sakurada 2019) and similar graph compression algorithms. However, these optimizations are typically lossy and may reduce the positioning accuracy.
The branch hints introduced by the annotated version only had a measurable effect on the runtime of the large model predictions. Due to improved branch predictions, the correct data is prefetched in more cases. Thus, memory access latencies are hidden more effectively, which directly improves the throughput of singles positioning. With increasing GTB model size, the access latency of the memory level in which a model can reside becomes more costly in turn growing the importance of this optimization.
Even if throughput requirements of a full PET system cannot be met by a single processing server, the flexibility of the DAPA allows the data stream of each singles processing unit (SPU) to be processed on independant machines. Only the final coincidence search requires aggregation into one coherent time line. Thus, processing throughputs can be increased beyond the results presented here.
While the universal throughput model uses the specific characteristics of a trained GTB model, its general formulation allows arbitrary GTB models to be evaluated. As an alternative throughput model that only uses the initially supplied GTB model parameters, the narrow throughput model could be preferred in order to avoid the potentially time-consuming training process of the GTB models. However, the applicability to GTB models with different feature sets is limited. Evaluating the two regression models on the measured data shows that the accuracy of the narrow throughput model lies significantly below the results of the universal throughput model.
While these throughput models offer throughput estimates on a specific processor and node configuration, the model could be generalized by incorporating characteristics like cache sizes to allow for its applicability to any underlying computing hardware. This remains subject of future work.
The selection of a GTB model represents a multiobjective optimization problem that aims to maximize the positioning accuracy of the chosen GTB model and its processing throughput. A set of optimal solutions can be generated by computing the Pareto frontier (Luc 2008). For these points, neither accuracy nor throughput can be improved without worsening the other criterion.
Applying this technique to the collected benchmarking data yields the 16 GTB model configurations in table 3 each with a given number of executing processes that serve as a baseline for the model selection.
The visualization in figure 7 shows this same set of points as part of the complete benchmarking data. The vertical grouping of points stems from multiple executions with the same GTB model utilizing different amounts of parallel processes. Provided positioning accuracy estimates for a given GTB model parameter configuration, the same methodology can be combined with either of the above throughput models to Examining the trends in the Pareto frontier data reveals that in order to increase positioning accuracy and to retain optimality, first the additional calculated features should be added to a GTB model. Although  1 models achieve higher throughputs for the same hyperparameter settings compared to  2 models,  2 models should almost always be preferred over  1 models when trying to achieve the same positioning accuracy. Next, it is most beneficial to increase the tree depth even up to depth 10. Further trees should be added last as it improves the positioning accuracy only marginally. The optimal learning rate decreases from 0.7 to 0.4 when trees are grown deeper. Here, a value of 0.2 should only be chosen in large models for minor positioning accuracy gains.

Conclusion
This work demonstrates the real-world applicability of GTB-based singles positioning by providing a high throughput implementation in a parallelized processing framework for PET SCP. A range of optimizations including model compilation and data type conversion was shown to significantly increase the processing throughput resulting in an achievable throughput of more than 9.5 GB/s. A throughput analysis revealed that GTB prediction is mainly limited by memory access latencies due to irregular memory accesses of the compiled GTB model binary. Thus, reducing the memory footprint of these binaries such that they fit into lower-level caches was shown to be beneficial for increasing throughput. Additionally, branch annotations proved effective in improving the pipelining efficiency and hiding memory access latencies by lowering branch misprediction ratios. For the fully optimized implementation, a precise throughput model was derived that enables throughput estimation with a MAPE of 17.33% and a MAE of 175.78 MB/s. In the future, the processing throughput of GTBbased singles positioning could be further improved by lowering the memory consumption via compression of GTB models (Nakamura and Sakurada 2019). Similarly, GTB models could be split into a range of trees that fit into a specific cache level and evaluating this tree subset on all instances before progressing to the next range of trees (Ilic and Kuvshynov 2017). Moreover, previous research (Asadi et al 2014) suggests converting control dependencies occurring due to branching into data dependencies by employing predication. Thus, jumps in the underlying assembly code are replaced by memory accesses which also cause memory access latencies. However, vectorization may hide these latencies and speed up prediction overall. The presented throughput model may be generalized to become architecture-independent by incorporating specific cache sizes and different memory access latencies within the memory subsystem. This would enable more accurate throughput predictions when progressing to new computing hardware.