Accelerating Dedispersion Using Many-core Architectures

Astrophysical radio signals are excellent probes of extreme physical processes that emit them. However, to reach Earth, electromagnetic radiation passes through the ionized interstellar medium, introducing a frequency-dependent time delay (dispersion) to the emitted signal. Removing dispersion enables searches for transient signals like fast radio bursts or repeating signals from isolated pulsars or those in orbit around other compact objects. The sheer volume and high resolution of data that next-generation radio telescopes will produce require high-performance computing solutions and algorithms to be used in time-domain data-processing pipelines to extract scientifically valuable results in real time. This paper presents a state-of-the-art implementation of brute force incoherent dedispersion on NVIDIA graphics-processing units and on Intel and AMD central-processing units. We show that our implementation is 4× faster (8-bit 8192 channels input) than other available solutions, and we demonstrate, using 11 existing telescopes, that our implementation is at least 20× faster than real time. This work is part of the AstroAccelerate package.


Introduction
An upcoming new generation of radio telescopes, such as the Square Kilometre Array (SKA) (Carilli & Rawlings, 2004;SKA, 2019), will simultaneously observe many different regions of the radio sky.Each simultaneous observation will have high time resolution and fine channelisation of the observed bandwidth, giving rise to large data volumes at high data rates.These data are expected to make storing all data for offline analysis impractical, necessitating faster than real-time data processing software.
To extract the very faint signals present in the noisy data produced by these telescopes, many processing steps have to be performed on the data.One of the more computationally expensive steps is dedispersion.The dedispersion process increases the signal-to-noise ratio (SNR) of received signals from the emitting object being studied by shifting samples in different frequency channels in time, thus correcting for the time delay introduced by dispersion.Samples at the same time-stamp are then summed over frequency channels, increasing the SNR and probability of detection.
The dispersion of the emitted pulse occurs due to the interaction between photons in the pulse and the ionised interstellar medium (ISM) through which they travel.Dispersion has the effect of causing a frequency dependent time delay (∆τ ) in the photon's propagation.Specifically lower frequency photons within the pulse (f low ) are observed later than their high frequency (f high ) counterparts (see Lorimer & Kramer, 2005).This time delay is proportional to the inverse square of the frequency, given by the cold plasma dispersion law with the constant of proportionality C DM = 4148.8× 10 3 MHz 2 pc −1 cm 3 s.The parameter DM in Equation ( 1) is referred to as the dispersion measure and is defined as the integral of the electron column density (n e ) along the line of sight (distance d) between source and observer, i.e.DM = d 0 n e dl . (2) Given the quadratic relationship between time delay and frequency (see Equation ( 1)), dispersion is governed by a single free parameter, the dispersion measure.When searching for new events from unknown sources, the distance between the object and the observer is unknown, which means all possible dispersion measures must be calculated and searched.Detecting and studying such events in real-time on the scale required for the next generation of radio telescopes, together with the computational complexity of dedispersion, necessitates a high performance computing (HPC) solution for real-time observation and detection (Armour et al., 2012;Fluke, 2012;Barsdell et al., 2010).
To remove the effects of dispersion two different approaches can be used, coherent and incoherent dedispersion.The coherent approach uses information about the observed phase of the pulse to reconstruct the pulse profile as it was emitted (within the limits of the inhomogeneous scattering of the ISM).The incoherent method applies the appropriate time delay to each independent frequency channel in channelised intensity data.Although the coherent method is more accurate and has higher sensitivity, computational requirements are far more demanding than for the incoherent method.As such, when performing surveys of the radio sky, it is common to employ incoherent dedispersion.
Several different codes exist with differing implementations of dedispersion specifically for Graphics Processing Units (GPUs) (Magro et al., 2011;Barsdell et al., 2012;Sclocco et al., 2016;Bassa et al., 2017;Zackay & Ofek, 2017;Kong et al., 2021).There are other implementations of the dedispersion transform for example using Fast Fourier transforms by Bassa et al. (2022).This approach is however not capable of detecting FRBs or accelerated pulsars due to no or weak Fourier response to these signals.In this article we present our implementations of dedispersion for different computer architectures, NVIDIA GPUs, Intel and AMD CPUs.We compare the performance of these implementations with the state of the art packages, specifically looking at data processing rates and sensitivity.
Incoherent dedispersion is described in Section 2, whilst the implementation for the CPU is presented in Section 3 and GPU in Section 4. In Section 5 we discuss performance results of AstroAccelerate for different scenarios on selected many-core platforms and compare these results with the performance of other software packages like Heimdall.Real-time performance on selected telescopes is presented in Section 5.5 and the conclusion are summarised in Section 6.

The Direct Dispersion Transform
Incoherent dedispersion is the process of shifting detected power data in time inside each individual frequency channel which collectively makes up the total telescope bandwidth.Shifts are applied to counter the effect of interstellar (or intergalactic) dispersion before integrating the data over the frequency bandwidth of the telescope to increase the signal-to-noise ratio of astrophysical signals detected by the telescope.
Here we present the direct (sometimes called brute force) approach to de-dispersing measured power data.As well as being the simplest approach for performing the task of dedispersion, it has two significant advantages.The first is that the algorithm is exact, by this we mean that the errors associated with this approach are at the discretization level of the instrument.The second is that the algorithm can be written in such a way that is particularly suited to execution on many-core devices.
Incoherent dedispersion can be algebraically expressed as where a frequency dependent shift in time ∆t(DM, t, f ) is calculated for each digitised frequency channel.By substituting f low = f c − ∆f /2 and f high = f c + ∆f /2 into Equation (1) we can express the time shift in the form ∆t(DM, t, f ) ≈ 8297.4 × 10 3 DM ∆f where f c corresponds to the central frequency of the band and ∆f is a finite bandwidth that is ∆f ≪ f c .Applying the correct time shifts to each frequency channel results in a shifted signal that appears as though it has arrived at the same instant in time.The process of incoherent dedispersion is shown pictorially in Figure 1 in which dotted lines (red and black) correspond to different DM trials 1, 2, ..., N .From Equation (3) we can derive a simple pseudo-code (see Algorithm 1) that outlines the direct dedispersion approach.
From pseudo-code 1 we see that for N DM trial dedispersion searches over power data that has N t time samples and N f frequency samples the computational complexity of the algorithm is O(N DM N t N f ).Data: Algorithm 1: Pseudo-code for the direct dispersion transform, where N t is the number of time samples, N f is the number of dispersion measures searched and N f is the number of frequency channels.
The arithmetic intensity I is another important characteristic of an algorithm.The roofline model Williams et al. (2009) defines I as a ratio of the number of floating point operations performed by the algorithm per amount of data required in bytes read or written by the algorithm to RAM or GPU main memory in the case of GPUs.The value of I can help us to identify what will limit the performance of an algorithm.When the algorithm is limited by the number of floating point operations it needs to perform, it is called a compute-bound algorithm.If the memory bandwidth limits the algorithm by not providing enough data per second, we have a memory-bound problem.
In order to decide whether an algorithm on a given hardware platform (CPU, GPU) is compute-bound or memory-bound, we need to look at the critical arithmetic intensity I crit that represents a turning point from an algorithm being memory-bound to being computebound and vice versa.For a given hardware platform, the critical arithmetic intensity I crit is a ratio of computational performance in FLOPS and the memory bandwidth in bytes.If I alg of an algorithm is I alg < I crit that algorithm is memory-bound.For I alg > I crit , the algorithm will be compute-bound.On the modern hardware I crit > 1, see Table 1 for values of I crit .
The dedispersion's arithmetic intensity for a single DM trial is given as: where n o is the number of floating point operations performed, N f is the number of frequency channels and n b is the number of bytes required.We have assumed that incoming data are 8-bit and the output data from the dedispersion is FP32 (4 bytes).Thus the dedispersion transform will be memory-bound on the most modern hardware platforms.Therefore data reuse in an available cache must be utilised to increase dedispersion performance.

CPU Implementation
Our CPU implementation of the incoherent dedispersion algorithm is written in the C programming language with OpenMP and Cilk Plus, where OpenMP is used for parallelisation across cores on multi-core CPUs and Cilk Plus is used to express fine grained parallelism and allows the compiler to effectively vectorize parts of the code.2), also known as loop blocking, or strip mine and interchange (Wolf & Lam, 1991), which transforms the input domain (memory) to smaller chunks able to fit into cache, thereby improving the locality of data accesses in loops.This technique also helps reduce the number of cache misses. 2 Specifically we use "tiling" in frequency channels as well as in dedispersion trials (DMs), where each thread calculates: 1) several DM-trials for neighbouring time samples (achieving good spatial locality in cache), 2) several neighbouring DM-trials (achieving good temporal locality in cache 3 , as a cacheline can be used multiple times to update multiple DM-trials).A schematic overview of our partitioning of the data space is shown in Figure 3.  and j).On the left and on the right we see the situation when loop blocking is not used and the case when it is.The original large array is partitioned into smaller blocks (blue rectangle), which can fit into cache size.

Data
In the pseudo-code (see Algorithm 2) the size of the tiles are represented by D f for frequency channels and D dm in the case of the DM.For the code to achieve the best performance, optimal values of D f , D dm need to be found as well as the optimal number of time samples per thread (D t ).These optimal values are dependent not only on the CPU used, but also on the input telescope data parameters (like central frequency, number of frequency channels, etc.) and the DM survey plan.
timise for cache hits.In principle, a cache hit occurs when the cache contains data needed by a thread otherwise a cache miss is generated, meaning that data has to be reloaded from main memory, thus reducing performance.
3 The amount of reuse is dependent on the closeness of neighbouring DM-trials.
Figure 3: Example of the loop tiling in the case of the CPU dedispersion algorithm.One thread computes a partial sum of D f frequency channels (f 1 -f 4 ) of D dm for a number of DM trials (DM 1 and DM 2 ) and for D t time samples with the same DM trials (t 0 -t 4 ); all of these are represented by the coloured dashed lines, where different colours correspond to different DM trials, the x-axis the time samples and yaxis the frequency channels.

GPU implementation
For a GPU code limited by the memory bandwidth to the GPU main memory, it is essential to reuse data and effectively and efficiently use the L1/L2 cache or the user-managed cache called the shared memory.For peak performance, we need to ensure three things.Firstly, the accumulator that stores the integrated value of the frequency channels S loc must be stored in the fastest area of memory available.Secondly, the data for each frequency channel that will undergo the dedispersion transform must be readily available to the GPU's stream-ing multiprocessors; a compute unit analogous to CPU cores.Finally, to avoid costly evaluation of the power-law by each thread, the value of the dedispersion shift should be calculated using as few operations as possible.
The advantage of the shared memory over an L1/L2 cache is that the user can control data locality.The shortfall of the shared memory is its size.Where the L1 cache can defer to the larger but slower L2 cache, the shared memory has no such option.Any implementation that uses shared memory and relies on a custom data structure will be limited by its size.This limitation gives rise to the two different algorithms outlined below.In short, the shared memory version of the direct dedispersion transform can process most shift values with high performance; a cache version, while slower, can handle even large shifts, often present at lower central frequencies.size of the area processed by a single threadblock is tunable.Data from the input x(f, t) are read in a coalesced manner ensuring the best possible performance.
The higher performing shared memory version is described below.The cache version will not be described further.In the results section the cache version is only used where we cannot use the shared memory version of our code.

Shared Memory Algorithm
The pseudo-code for the GPU kernel is presented in Algorithm 3. In the shared memory GPU implementation of the dedispersion algorithm, each thread-block calculates a different part of the output DM(dm, t) space as shown in Figure 4.A single thread from a thread-block loops over frequency channels with step of D ch channels, where it loads a single element of the x(f, t) data into a local buffer B loc using the shift ∆t bl = T bl ∆t pc , where the coefficient T bl in Eq. ( 6) represent the lowest DM calculated by the thread-block, and ∆t pc are the coefficients of the cold plasma dispersion law for each frequency channel.All threads in the thread-block thus form a contiguous (in time) block of x(f, t) data in shared memory.This is shown in Figure 5.
When local buffer B loc is loaded, each thread sums appropriate elements in shared memory using the differential shift ∆t diff = T diff ∆t pc and updates their value of the partial sum S loc , held in registers, that in the end will result into dedispersed value expressed by Eq. ( 3).The coefficient T diff represents DM values calculated by different threads within the range of DM values calculated by a thread-block (D dm ).After this, threads integrate the loop over frequency channels and a new block of x(f, t) data is loaded.
To avoid costly evaluation of the power-law by each thread the calculation of the dispersion shift ∆t was split into two parts.The first part is an array of pre-calculated coefficients of the cold plasma dispersion law ∆t pc which are evaluated as: where ∆t pc (i) represent time shift for i-th frequency channel, f high is the highest frequency of the telescope bandwidth and ∆f is a bandwidth of single frequency channel.This array is then scaled in the thread-block by the DM value that is being evaluated.The second part required for calculation of ∆t are the two DM coefficients: T bl which is constant within the thread-block, and T diff which relates to DM calculated by a thread (see Algorithm 3).
Further performance improvements can be achieved by processing multiple time samples N REG per thread.This exploits instruction level parallelism, where a thread can process multiple independent instructions.The adverse effect is that processing too many time samples per thread increases register usage too much, leading to fewer active threads and lower performance.
When working with input data of 8-bits or less, we pack the data into 32-bit words and then use integer addition to achieve SIMD (single instruction, multiple data) in word.In the case of 8-bit data, this allows us to process two time samples per operation, increasing throughput significantly.
The values of N REG , D t , D dm and D ch have a significant impact on the performance of this code and must be tuned for a given telescope configuration and DM plan to gain maximum performance.

Results
In this section, we first explore how the performance of our GPU dedispersion code, which is part of the AstroAccelerate package, depends on the parameters of input data and the dedispersion plan.To compare our dedispersion to other implementations, we have considered two test cases.The first test measures the execution time of the direct dedispersion transform for a varying number of frequency channels using a fixed dedispersion plan (subsection 5.2).The second test, described in subsec-tion 5.3, demonstrates the performance of our dedispersion implementation when used in a data processing pipeline.For this test, we have used three datasets with different central frequencies and dedispersion plans.We compare the output from the different implementations tested in subsection 5.4.Finally, we demonstrate AstroAccelerate performance on selected radio telescopes in subsection 5.5.
We compare our results with currently existing and in use implementations of direct dedispersion suited for HPC environments.Specifically, for the first test and the numerical difference test, we use the code "Dedisp" (Barsdell et al., 2012) and "Dedispersion" (Sclocco et al., 2016).Due to similar names we will refer to these dedispersion implementations by associated processing pipelines.The Heimdall5 pipeline is a GPU accelerated transient detection pipeline that utilises "Dedisp" while "Dedispersion" is used in the Amber pipeline (Sclocco et al., 2020).
The Amber dedispersion code is written using the Open Computing Language (OpenCL) programming language, while Heimdall's Dedisp uses the CUDA C/C++ programming model.Both implementations can run in two modes/regimes.Heimdall, in "adaptive" mode, changes the DM step during dedispersion depending on the parameter of DM tolerance set by the user.The number of DM trials thus varies.The second mode of Heimdall can be described as "fixed", meaning that the user selects a fixed step size in the DM range, thus the pipeline outputs a fixed number of DM trials.In the case of Amber, for single DM, the frequency channels are divided into sub-bands which are firstly dedispersed ("step one"), and then dedispersed within each sub-band ("step two").Our code performs a fixed number of DM trials with a fixed DM step for each DM range akin to the "fixed" mode in Heimdall and one of the mode in Amber.
In all following tests, the input signal is generated using "fake" from the pulsar processing package SIGPROC.We generated synthetic filterbank files for 4-bit, 8-bit and 16-bit preci-sions.
We have used two GPU cards (NVIDIA Tesla V100 -Volta generation, NVIDIA A100 -Ampere generation), two Intel processors (Xeon Phi 7290 -Knights Landing (KNL), Xeon Gold 6230 -Cascade Lake) and AMD processor EPYC 7601 -Naples.Their hardware specifications can be found in the Table 1 where  where FLOPs/instruction=2 in the case of fused multiply-add (FMA), and the vector width is 16 for single precision and 8 for double precision.Intel CPUs (Xeon Gold 6230, Xeon Phi 7290) have two AVX-512 units thus instructions/cycle=2, for AMD EPYC 7601 is instructions/cycle=1.In the case when all cores utilise AVX-512 instructions the core clock frequency is reduced by 100-200 GHz (Intel Corporation, 2019, 2016).
For the time measurements of Heimdall and AstroAccelerate, we used the NVIDIA Compute profiler software (ncu) for the GPUs and omp_get_wtime function for the CPU case, whilst in the case of Amber, we followed the supplied timer using OpenCL functions.Where possible, we use the supplied autotuning scripts for each test and all codes to achieve the best performance.Unless otherwise stated, all execution times show kernel run-time only, that is, no data transfers from host to device (GPU) are taken into account.The GPU codes are compiled with nvcc compiler and the CPU codes with the Intel compiler (ICC).Codes based on OpenCL are compiled with nvcc for GPUs and ICC for CPUs using appropriate OpenCL flag enabled.Used compiler flags are summarised in Table 1.-std=c99 -O2 -Wall -Wextra -qopenmp -march=core-avx2 -qopt-prefetch -fma -ftz -fomit-frame-pointer -finline-functions -qopt-streaming-stores=never Xeon Phi 7290: -qopenmp -fp-model fast=2 -std=c99 -O2 -fma -xMIC-AVX512 -align -finline-functions -no-prec-div -ipo -DOPEMP SPEC -qopt-streaming-stores=never Xeon Gold 6230: same as Xeon 7290 except: -march=core-avx2

Performance dependency
To illustrate the dependency of AstroAccelerate's performance on different input data parameters, we have visualised all distinct cases in the form of a flow-plot shown in Figure 6.The parameters used are the number of frequency channels, sampling time of the data, observation time, the number of DM trials and size of the DM step.The performance is expressed in GFLOPS as this can be understood as an average performance per second.The execution time will still depend on the size of the task.The left plot in Figure 6 shows that the right combination of time sampling with DM step size is essential for high performance.The plot on the right of Figure 6 shows that the performance of the AA dedispersion code does not dependent on the observation time, as in all other tested codes.

Frequency resolution test
The following test shows the behaviour of the execution time of all mentioned codes when we change the number of channels.
For testing, we created a synthetic signal with 4-bit, 8-bit and 16-bit samples with a time sampling of 64 µs for five different channelisations: 512, 1024, 2048, 4096 and 8192, where the last one corresponds to the maximum number of channels that Heimdall can safely process due to integer overflow.The length of the input signal corresponds to 10 s of observation data with a central frequency 1400 MHz and a total bandwidth of 300 MHz.This observation length is sufficient to get representative performance measurements.Moreover, it allows us to extend our testing up to 8192 channels also for the Heimdall dedispersion code, which cannot be run at this number of channels for longer observation times.For the survey plan, we use three DM ranges without time binning (also known as downsampling), which together search for signals ranging from a DM 0 to 500 pc cm −3 .We limited the search to 500 pc cm −3 because at high DM it is common to use downsampling/scrunch factor which we do not include in this test.Moreover, the high DM searches are covered in the second test.
The dedispersion plan used is summarised in Table 2.
The execution time of the dedispersion plan using a differing number of frequency channels The left plot, where we have fixed the observation time and emphasised the DM step (colour) and sampling time (hue), shows that the performance is mainly affected by the combination of two parameters, the DM step and the sampling time.Lines of rich dark colour, where high sampling time is combined with a large DM step, show the least performing case, where shared memory GPU code cannot be used.Lines of brighter colour have increasing performance as the sampling time gets lower.The right plot, where the sampling time is fixed and only the DM step is emphasised, shows that the performance is independent of the observation time.and bit precisions for all tested codes are shown in Figure 7.The results of Amber Dedispersion for the 4-bit and 16-bit are missing because the code did not return the correct results of the injected signal from SIGPROC.There are also no results for our CPU implementation because it only supports 8-bit precision.
Although the OpenCL parallel language can be used across platforms, Intel officially does not support KNL. Therefore we do not show exe-cution times for Amber. 6Figure 8 shows how performance, expressed in the floating point operations per second (FLOPS), changes with increasing numbers of channels.By analysing our implementation of the incoherent dedispersion on the GPU using the NVIDIA Nsight Compute, we see that in the case of 4-bit and 8-bit precision our implementation is limited by the shared memory bandwidth on both GPU cards, whilst the 16-bit precision version is limited by the special function units (≈ 95% of utilisation) also on both GPU cards.The special function units take care of type conversions that are required by the 16-bit implementation.The summary of shared memory bandwidth utilisation by As-troAccelerate for each card and bit precision is  2) for 4-bit, 8-bit and 16-bit precision (from top to bottom) input data with an increasing number of channels, observation time T = 10 s, central frequency f c = 1400 MHz and total bandwidth of 300 MHz.
presented in Figure 9.

Processing pipelines
In this section, we analyse the execution time of the AstroAccelerate running the dedispersion plan with different DM ranges (0-3000 pc cm −3 ) and time binning factors (also known as downsampling/scrunch factors).We compare the results alongside the GPU accelerated pipeline -Heimdall.We have not compared the Amber pipeline as this was compared to Heimdall by Sclocco et al. (2020).
As the radio telescopes operate on a wide range of central frequencies, we selected three scenarios to demonstrate the performance and behaviour of both AstroAccelerate and Heimdall.
The selected central frequencies are f c =400 (low), f c =800 (mid) and f c =1400 (high), each with typical bandwidth, sampling rate, number of channels, and DM surveys plans (for details see Table 3).The synthetic input data were generated as in the previous cases using SIGPROC "fake" The observation lengths correspond to ∼300 s for the low central frequency case, and ∼50 s for the others.
To compare both pipelines fairly and use all the implemented features, we must ensure they use the same or comparable DM plan.Heimdall, by default, uses an "adaptive" mode.That is, Heimdall generates a list of DMs during the start-up phase of the pipeline based on the smearing tolerance factor between individual DM trials (default value 1.25), thereby changing its DM step for each DM trial.Whereas AstroAccelerate uses a user-defined fixed DM step for each DM range.To obtain a similar DM plan for AstroAccelerate, we used the "continuous" plan from Heimdall and created the closest "discrete" plan for our pipeline.A visualisation of these plans for all three cases is shown in Figure 10.Note that the highest DM is different for each case (∼1500, 2000 and 3000 pc cm −3 ).
The execution times for each bit precision and many-core system (GPU or CPU) are summarised in Figure 11.Similarly, as in the previous subsection, we found that our implementation is limited by the shared memory bandwidth (4-bit and 8-bit) or by special functional Visualisation of the continuous Heimdall survey DM plan with a discrete AstroAccelerate DM plan for all three cases: f c =400, 800 and 1400 MHz (from left to right).Details can be found in Table 3. units (16-bit).

Comparison of the dedispersed planes
The dedispersion output plane normalised to its own maximum from AstroAccelerate, Heimdall and Amber are presented in Figure 12.We used synthetic 8-bit data with an injected signal of DM=90.0 pc cm −3 .Each code was run in "fixed" mode with a simple survey plan searching pulses between DMs of 0-200 pc cm −3 with a step 0.5 pc cm −3 .As shown in Figure 12 all implementations recover the correct dispersion measure of the injected signal.However, as shown in Figure 13, there are small numerical differences (∼1%).These are introduced by the differences in the calculation of the time shift function, including the use of a slightly different constant of proportionality (C DM ) used in Eq. 1 in the different implementations.In addition, Heimdall dedispersion internally rescales the output dedispersed values, thus incurring a round-off error.Finally, in Figure 14, we provide a view of a single DM trial from the dedispersed outputs between AstroAccelerate, Heimdall, and Amber and our CPU implementation around the DM of the injected signal and their percentage difference.

Performance on selected telescopes
In this section, we provide performance results for AstroAccelerate for 10 selected telescopes and their settings for two GPUs, namely the Tesla V100 GPU and A100 GPU.We have determined the DM survey plan for each telescope setup using the DDplan.pytool from the PRESTO pulsar search and analysis software (Ransom, 2011).We measure the performance in units of fraction of real time computed as where t obs is the observation time that is being processed, and t c is the execution time of the AstroAccelerate pipeline.The execution time t c includes the time required for dedispersion, the transpose of the input data (if necessary) and downsampling (time binning).We do not include the input data transfer time from the host to the GPU device memory.Please note that R > 1 means that the computing is performed in real-time or greater, that is, the observation time is longer than the time needed for its processing.Table 4 summarises the basic characteristics of the selected telescopes, the range of the DM survey plan and the performance of the auto-tuned AstroAccelerate software pipeline on NVIDIA Tesla V100 and NVIDIA A100 GPU in units of fraction of real-time.As we can see, in all cases, AstroAccelerate operates in real time or greater.In the worst case scenario, AstroAccelerate achieved R = 20 on the NVIDIA Tesla V100 GPU and R = 25 on NVIDIA A100 GPU.For more details such as the DM survey plans for individual telescopes, memory transfers, CPU performance tests and others, please see Armour et al. (2022).

Discussion and Conclusions
In this paper, we present our CPU and GPU implementations of the incoherent dedispersion method for removing the effect of the frequency delay introduced due to the interstellar medium.Although dedispersion is only a part of the pulsar search process, its computational intensity scales rapidly with the amount of data and as such becomes a substantial contribution to the total processing time of the pipeline.We compare three different many-core implementations of the incoherent dedispersion transform, namely Dedisp by Barsdell et al. (2012) that is part of the Heimdall pipeline, Dedispersion by Sclocco et al. (2016) from the Amber pipeline and our implementation which is part of the AstroAccelerate project.We demonstrate that our implementation of dedispersion is faster and covers a wider range of different input data parameters.
In Figure 6, we show how the performance of AstroAccelerate depends on different input data parameters.The most important parameter for performance is the combination of sampling time and DM step.For a large enough DM step, the required data may no longer fit into the GPU's shared memory, and the cache dedispersion algorithm that relies on L1 or a slower L2 cache must be used.This leads to a performance loss of up to 4×.In such a case,   for Heimdall (top, green), Amber (middle, purple) and AstroAccelerate (bottom, orange) of an injected signal with DM=90.0 pc cm −3 .and benefit from higher performance due to the smaller data size and being in a more cachefriendly regime.
In the first benchmark in section 5.2 we compare all three implementations (Heimdall, Amber, our AstroAccelerate) processing input data of 4-bit, 8-bit and 16-bit precision producing a fixed number of DM trials but using different number of frequency channels.We demonstrate that the AstroAccelerate GPU implementation is at least 10× faster in the case of Heimdall and from 6× (with 512 channels) to 3.4× (8192 channels) faster than Amber on the NVIDIA Ampere generation A100 GPU.Sim- ilar results apply for the previous generation card -NVIDIA Tesla V100 GPU.The worst performance compared to Heimdall is for 16bit precision where AstroAccelerate is only 2× faster.
Our CPU version of the incoherent dedispersion on all tested CPUs is comparable in performance to the Heimdall code on NVIDIA Tesla V100 GPU for a higher number of frequency channels.Compared to the CPU version of Amber, our code is from 2× (512 channels) to 15× (8192 channels) faster for all tested CPUs.
Figure 8 shows that our implementations achieve, on average stable performance for all tested frequency channels in terms of FLOPS.That is, except for the case of 8192 channels for 4-bit input data due to the algorithm change.Our implementation achieves an average of ∼6.5 TFLOPS on Tesla V100 GPU for 4-bit, ∼4.5 TFLOPS for 8-bit precision data and ∼1.6 TFLOPS for 16-bit input data.Whilst on Tesla A100 ∼8.5 TFLOPS for 4-bit, ∼5 TFLOPS and 8-bit precision and  ∼2 GFLOPS in the case of 16-bit.The performance improvement of the Ampere generation compared to the Volta generation is mainly due to the increased shared memory bandwidth (from ∼14 TB/s to ∼18 TB/s.On the tested CPUs we get ∼0.4TFLOPS on KNL, ∼0.33 TFLOPS in case of AMD EPYC 7601 and ∼0.31 TFLOPS for Xeon Gold 6230.
The performance of other tested codes is mostly stable or improves with an increasing number of channels.Howev0er, Heimdall has a particular problem with the NVIDIA Tesla V100 GPU, as the performance decreases significantly for a high number of frequency channels, something which is not observed with the same code on the NVIDIA A100 GPU.We have observed an unusual behaviour of the Amber pipeline on the AMD CPU, where the performance decreases significantly for a higher number of frequency channels.This contradicts our expectations based on Amber's performance on the Intel CPU and the NVIDIA GPUs.This may indicate the use of platform-specific optimisation in Amber or the lack of OpenCL support for AMD EPYC CPU.It also shows that even though the OpenCL programming model is easily portable to different many-core systems, it does not always guarantee good performance.
In Section 5.3, we perform a pipeline test for three different central frequencies (f c =400 (low), f c =800 (mid) and f c =1400 (high)) searches running from 0 up to 1500, 2000 and 3000 pc cm −3 respectively with the sampling rate and the number of channels depending on central frequency.We execute all benchmarks for AstroAccelerate and Heimdall for 4-bit, 8bit and 16-bit precision input data (where applicable) on the GPUs as well as on the CPUs.We find that both AstroAccelerate and Heimdall on all tested platforms operate in real-time regimes.That is, the end-to-end execution time of a pipeline (that includes all required operations like time binning, data transfers to the GPU memory, dedispersion, etc.) for the selected dedispersion plan is lower than the observation time of the input data.Overall, our GPU version, compared to Heimdall, is 4-8x faster for all tested central frequencies and input data precisions.
The performance of our CPU implementation is comparable with Heimdall running on either NVIDIA V100 GPU or NVIDIA A100 GPU.Only at mid-central frequencies is Heimdall substantially faster.Depending on the structure of the pipeline, the CPU dedispersion code may offer a way to distribute the tasks between the CPU and GPU, where the GPU can, for example, perform FDAS (Ransom et al., 2002;Dimoudi et al., 2018b) or JERK search (Andersen & Ransom, 2018;Adámek et al., 2020) while the CPU calculates DM trials, thus enabling heterogeneous systems.
Lastly, we run AstroAccelerate on several telescope setups with different survey plans.The plans are created with the DDplan.pytool from PRESTO up to the typical dispersion measures ranges given in the corresponding telescope articles.On both NVIDIA V100 GPU and NVIDIA A100 GPU, we achieve realtime performance in all cases, i.e., from 20-5000 fraction of real-time and even 200 000 fraction of real-time for the VLA telescope.

Figure 1 :
Figure 1: Representation of the incoherent dedispersion approach.Top represents the input data and the bottom the results (output).For simplicity we present a clear single signal.The dotted lines correspond to individual DM trials (the summation of data points along the line) computed for a given time sample (showed for DM trials 1, 2, i and N).The sum of the line maps to one point (cross) in the output field.

Figure 2 :
Figure 2: Schematic example of tiling optimisation technique in case of two dimensions (iand j).On the left and on the right we see the situation when loop blocking is not used and the case when it is.The original large array is partitioned into smaller blocks (blue rectangle), which can fit into cache size.

Figure 4 :
Figure4: The output (dm, t) plane is partitioned into sections of size (D dm , D t ) that is processed by a single thread-block, shown in blue.Both of our GPU algorithms are written in the CUDA C/C++ programming language and use the following methodology: One GPU thread processes several time elements for a fixed value of dispersion.A thread stores accumulated values into registers 4 .Nearby values of time and DM in the output DM(dm, t) space are grouped together into thread-blocks (fig.4)such that a single thread-block calculates D t time samples and D dm dispersion steps.The

Figure 5 :
Figure 5: Data are loaded from cache lines of constant frequency, and contiguous time (grey boxes).Multiple dedispersion trials DM= constant, t values (red lines) are held in registers and each thread updates a set of these in parallel.

Figure 6 :
Figure 6: Flow-plot representing how AA performance (GFLOPS) depends on parameters of the data and dedispersion plan.Each line symbolises one run of the autotuned AA GPU dedispersion with data and a dedispersion plan described by parameters shown in the figure.These parameters are the number of channels, DM trials, DM step, samplings time of the data and observation time.For simplicity, only 8-bit input data and telescope central frequency set to 1400 MHz are shown.The left plot, where we have fixed the observation time and emphasised the DM step (colour) and sampling time (hue), shows that the performance is mainly affected by the combination of two parameters, the DM step and the sampling time.Lines of rich dark colour, where high sampling time is combined with a large DM step, show the least performing case, where shared memory GPU code cannot be used.Lines of brighter colour have increasing performance as the sampling time gets lower.The right plot, where the sampling time is fixed and only the DM step is emphasised, shows that the performance is independent of the observation time.

Figure 7 :
Figure 7: The execution time of the corresponding dedispersion plan (see Table2) for 4-bit, 8-bit and 16-bit precision (from top to bottom) input data with an increasing number of channels, observation time T = 10 s, central frequency f c = 1400 MHz and total bandwidth of 300 MHz.

Figure 8 :
Figure 8: Performance in FLOPS with an increasing number of channels for all tested codes.The first three rows correspond to 4-bit, 8-bit and 16-bit precision.

Figure 9 :
Figure 9: Percentage of shared memory bandwidth to the theoretical maximum achieved by AstroAccelerate dedispersion for different bit precisions and an increasing number of channels.The theoretical maximum of shared memory bandwidth for each GPU card is in the top right corner.
Figure 10:Visualisation of the continuous Heimdall survey DM plan with a discrete AstroAccelerate DM plan for all three cases: f c =400, 800 and 1400 MHz (from left to right).Details can be found in Table3.

Figure 11 :
Figure11: Execution times of AstroAccelerate and Heimdall on different many-core platforms for three central frequencies and bit precisions operating on a simulated signal of an observation length of ∼300 s for the low central frequency and ∼50 s for all other central frequencies.The top row corresponds to the central frequency f c = 400, the middle row to f c = 800 and the bottom row to f c = 1400 MHz (low, mid, high).Whilst in the left column are the results for 4-bit precision, in the middle column for 8-bit precision and in the right column for 16-bit precision.The grey boxes show the execution time including all PCIe transfers and overheads needed to finish the plan.The colour (lime green, dark green) bar shows the execution time of all kernels, for example transposing of the input data, downsampling data, etc., needed by the pipeline.

Figure 13 :
Figure 13: Percentage difference of the normalised dedispersion outputs.From top to bottom: AstroAccelerate and Heimdall, AstroAccelerate and Amber.

Figure 14 :
Figure 14:Comparison of the dedispersed (DDTR) outputs and their percentage difference at DM = 90 pc cm −3 between AstroAccelerate, Heimdall, Amber at the top and our CPU implementation at the bottom.

Table 1 :
Hardware specifications and compiler specifications on the tested GPUs and CPUs.The value of the memory bandwidth in brackets on Xeon Phi 7290 corresponds to the case of using the 16 GB Multi-Channel DRAM (MCDRAM).

Table 3 :
Specifications of the input data used for pipeline comparison.The observation length corresponds ∼300 s for the low central frequency and ∼50 s for mid and high central frequencies.

Table 4 :
List of selected radio telescopes, their characteristics, search range and performance of the AstroAccelerate in fractions of real-time.