A novel CNN-based image segmentation pipeline for individualized feline spinal cord stimulation modeling

Objective. Spinal cord stimulation (SCS) is a well-established treatment for managing certain chronic pain conditions. More recently, it has also garnered attention as a means of modulating neural activity to restore lost autonomic or sensory-motor function. Personalized modeling and treatment planning are critical aspects of safe and effective SCS (Rowald and Amft 2022 Front. Neurorobotics 16 983072, Wagner et al 2018 Nature 563 65–71). However, the generation of spine models at the required level of detail and accuracy requires time and labor intensive manual image segmentation by human experts. This study aims to develop a maximally automated segmentation routine capable of producing high-quality anatomical models, even with limited data, to facilitate safe and effective personalized SCS treatment planning. Approach. We developed an automated image segmentation and model generation pipeline based on a novel convolutional neural network (CNN) architecture trained on feline spinal cord magnetic resonance imaging data. The pipeline includes steps for image preprocessing, data augmentation, transfer learning, and cleanup. To assess the relative importance of each step in the pipeline and our choice of CNN architecture, we systematically dropped steps or substituted architectures, quantifying the downstream effects in terms of tissue segmentation quality (Jaccard index and Hausdorff distance) and predicted nerve recruitment (estimated axonal depolarization). Main results. The leave-one-out analysis demonstrated that each pipeline step contributed a small but measurable increment to mean segmentation quality. Surprisingly, minor differences in segmentation accuracy translated to significant deviations (ranging between 4% and 13% for each pipeline step) in predicted nerve recruitment, highlighting the importance of careful workflow design. Additionally, transfer learning techniques enhanced segmentation metric consistency and allowed generalization to a completely different spine region with minimal additional training data. Significance. To our knowledge, this work is the first to assess the downstream impacts of segmentation quality differences on neurostimulation predictions. It highlights the role of each step in the pipeline and paves the way towards fully automated, personalized SCS treatment planning in clinical settings.


Introduction
Electroceutical and neuroprosthetic therapies seek to modulate neural activity using electrical stimulation.Recent high-profile examples include intraspinal spinal cord stimulation (SCS) neuromodulation of the spinal cord itself [3,47], as well as epidural stimulation of the spinal roots to relieve chronic pain [8,24] or restore motion following spinal injury [45,58].The ability to selectively stimulate individual spinal roots is critical for the development of safe and effective therapies, as each root comprises a functionally distinct neural population controlling various aspects of sensing and signaling in the body.However, inter-subject variability is large, and personalization of implant placement and stimulation parameters is required for optimal outcomes [29,45].The time cost and effort of subject-specific tuning in adjustment sessions at specialized medical centers can be prohibitive.Recent studies have demonstrated the potential for both dramatic acceleration and improved therapeutic outcomes via computational approaches that use subject data to construct personalized spine models, perform electromagnetic exposure simulations and predict induced neural responses [29,45].Unfortunately, key aspects of personalized spine model generation, especially the accurate segmentation of physiologically important but anatomically complex structures (e.g.spinal roots), is a difficult and time-consuming task, with additional limitations imposed by image quality.Typically, the segmentation of a complete spine dataset requires weeks of effort, as a human expert laboriously labels magnetic resonance imaging (MRI) pixels in each image slice.Hence, there is a need for high-resolution/highcontrast imaging and a maximally automated procedure capable of reliably producing anatomical models with the necessary level of detail and accuracy.In addition to speed, automated labeling tools promise a degree of objectivity by relying less on the 'art' of the human expert.Significant efforts have already established the broad utility of machine learning (ML) techniques in the field of medical imaging [30], and specifically the application of convolutional neural networks (CNNs) for spinal MR data segmentation [16,27,52].In general, the robustness and quality of network predictions depend sensitively on the quality, resolution and quantity of available training data, making the acquisition of sufficient applicationspecific data an important bottleneck.To overcome these challenges, a customized MR imaging protocol was established to provide high-quality data suitable for the reconstruction of detailed spine models with multiple tissues.Additionally, a novel CNN was implemented to perform automated semantic segmentation on the generated MR images.This network was embedded in an image processing pipeline with automated pre-and postprocessing routines for further reductions in the need for manual intervention.
In summary, the goals of this study were to: (i) establish an automated pipeline for the robust generation of multi-tissue, high-quality spinal cord models with minimal human input; (ii) assess pipeline performance with respect to both segmentation quality metrics and nerve recruitment predictions; (iii) quantify and study the impact of individual steps in the segmentation pipeline on the aforementioned metrics; and (iv) deploy the pipeline on o 2 S 2 PARC (Open Online Simulations for Stimulating Peripheral Activity to Relieve Conditions (SPARCs), https://docs.osparc.io/), a cloud-based platform for open, FAIR (findable, accessible, interoperable, reusable), and collaborative computational life sciences developed in the context of the NIH SPARCs initiative [40].Importantly, the CNN-based image segmentation pipeline described here was embedded in a comprehensive workflow for the generation and simulation of functional spine models as part of an ongoing effort to advance SCS for therapeutic bladder control [22,23].The flexibility and generality required for integration within a broader computational modeling pipeline ensures that our image segmentation pipeline is compatible with a spectrum of future applications.

Bladder control pipeline 2.1.1. Pipeline steps
The SCS modeling pipeline for bladder control therapy comprises the following steps: (i) CNNbased image segmentation; (ii) interactive expert inspection (and correction as needed) in iSEG [65]; (iii) anatomical model extraction and discretization with Sim4Life.io[66]; (iv) implant insertion, application of boundary conditions and tissue property assignment, including diffusion tensor imaging (DTI)-derived or diffusion simulation-based determination of anisotropic conductivity tensors; (v) reconstruction of spinal root fiber trajectories and neurophysiological model assignment; (vi) neuronal dynamics simulations; and (vii) optimization of selective spinal root recruitment.See [23] for additional information related to the overall pipeline.The present work chiefly concerns the development and assessment of CNN-based tissue labeling of the subject imaging data (step i) above), complemented by an analysis of the impact of CNN architecture and the different pre-and postprocessing steps on the resulting segmentation accuracy and recruitment predictions.

o 2 S 2 PARC implementation
We implemented the bladder control pipeline on o 2 S 2 PARC, which offers a variety of computational and interactive 'services' that enable the creation of modular, multi-stage pipelines.The overall SCS pipeline makes use of the following services: iSeg (semi-manual image segmentation); Sim4Life.io(geometric modeling, electromagnetic simulation, visualization); and a GPU-based, CUDA-enabled JupyterLab notebook pre-configured for ML applications.The o 2 S 2 PARC hardware setup comprises a 32-core AMD Threadripper 3970X with 256GB of DDR4 RAM and two Nvidia GigaByte RTX 3090 24GB GPUs with NVLink.See section 4.5 for more details.

Sample preparation and imaging
Three cat spinal cords were scanned to obtain the MRI datasets used in this study: (i) LS1: lumbosacral spine of cat 1, with a field of view spanning the L6-S2 vertebrae (4 cm long, 800 slices), (ii) LS2: lumbosacral spine of cat 2, with two scans merged, resulting in a field of view spanning L5-S3 (7.7 cm long, 770 slices), and (iii) CS1: cervical spine of cat 3, with a field of view spanning C4-T1 (4 cm long, 800 slices).The tissue samples of interest were first dissected from formalin preserved cats (Ward's Science), then re-hydrated and contrast enhanced in 0.2% Gadavist in 1x phosphate buffered saline.Finally, they were submerged in Fomblin and imaged in a Bruker vertical-bore 11.7 T micro-imaging system.A set of high-resolution MR scans were acquired for each sample: a T2-weighted image (TE: 17-22 ms, TR: 300-500 ms, flip angle: 90, 50 µm in-plane resolution), a fat enhanced image (TE: 35-38 ms, TR: 300-500 ms, 100 µm in-plane resolution), and a 19 F image (TE: 16-28 ms, TR: 500-600 ms, 100 µm in-plane resolution).
The images of each sample were manually coregistered to the T2-weighted image and re-sampled to 50 µm in-plane resolution with the open 3D image software, FSLeyes [39].To facilitate neural network generalization and transfer learning, axial image slice dimensions (x-y plane) were held consistently at 384x512 pixels.In addition, the diffusion weighted images were used to calculate diffusion tensors to inform dielectric heterogeneity in the electromagnetic modeling in Sim4Life.ioand to reconstruct spinal white matter fiber tracts for the neural recruitment assessment using DSI Studio [63].Additionally, two other independent operators manually segmented six uniformly distributed slices drawn from the validation datasets (LS1 and CS1) to estimate the variability associated with human expert segmentation-a quantity that we refer to as interoperator variability (IOV).Every operator segmented the following tissues: cerebrospinal fluid (CSF), dorsal roots, epidural fat, gray matter, ventral roots and white matter.

Image preprocessing
Preprocessing was applied to improve segmentation performance and generalization across different facilities, scanners, and-to a limited degree-imaging protocols.

Bias field correction
Low-frequency field heterogeneity is a common issue affecting MR imaging in an effect called bias field or gain field.See [5,18,57] for additional background and a discussion of corrective measures.Here, we employed N4ITK bias correction [55], wherein a Bspline approximation is applied to remove the gain field using a hierarchical optimization scheme.It was selected for performance and simplicity of implementation via the ITK Insight Toolkit [33].

Masking
In most cases, ML segmentation performance can be improved by focusing on the region-of-interest (ROI) and eliminating location ambiguity.Furthermore, normalization methods ensure scale correspondence across datasets.In the present application, despite a wide field-of-view, only a limited region encompassing the spine matters.Therefore, an adaptive masking algorithm was applied to extract the ROI.This algorithm proceeds as follows: First, the union of all non-background labels is extracted from each slice, and any holes or gaps are filled.Next, a morphological dilation operator and erosion operator are applied (operator parameters: 10% of the larger slice dimension; five pixels less than the larger slice dimension, respectively).This regularizes the mask and preserves enough of the spine neighborhood for the CNN to also learn and segment the spine surface.Finally, the slice masks were interpolated over the space between the sampled slices using a level-set approach (zero isosurface of the linearly interpolated signed-distance transformation, implemented using SITK [4,32,34,62] and Numpy [1]).Extrapolation beyond the reference slices was realized through linear extrusion.See figure A1 for an illustration of the mask generation process.

Intensity normalization
MR image intensity normalization was employed to reduce variability arising from differences between scanners and recording processes [42].From a range of possibilities (e.g.Z-score, Gaussian mixture models, piece wise linear histogram matching) we selected Nyul normalization [38] due to its stability and compatibility with masked data.It was implemented using the open-source intensity-normalization software library [42].Additionally, Nyul normalization enhances network training performance by shifting input values to a meaningful range.The histogram of each input channel of LS1 and CS1 were normalized against those of LS2 (the choice of reference had no apparent impact).Figure 3 depicts the matched intensity histograms.

Performance metrics
Various metrics were used to evaluate: (i) segmentation performance; (ii) the impacts of certain pipeline design choices; and (iii) neural network training.These metrics can be categorized as pertaining to either segmentation accuracy or nerve recruitment predictions.

Segmentation accuracy
For segmentation accuracy quantification, we distinguish two subcategories of metric: The first includes metrics used for determining the quality of the segmentation workflow result, while the second consists of a single quantity that is optimized during the CNN training process.The principal segmentation accuracy metric employed in this study is the Jaccard index J.Let A, B ⊆ U be two sub-regions in a finite region U ⊂ R n .Then, the Jaccard index J is defined as the map: For labeled images, |A| can be obtained as the number of pixels with the corresponding label (pixel sizes/volumes cancel out in the ratio).A higher Jaccard index signifies a higher overlap, and thus a greater correspondence between ground truth and prediction.
However, deviations in boundary locations may ultimately have important impacts on nerve recruitment predictions and the Jaccard index is less sensitive to such deviations (especially when |A| is bulky, i.e. it has a high volume to surface ratio).Therefore, a second segmentation performance metric was considered.The Hausdorff distance H is defined as During supervised training of an artificial neural network, the evaluation of a loss function is optimized.In this study, the Dice Loss DL λ was used, which is a smoothed variant of the Dice Coefficient.Given a positive smoothing parameter λ ∈ R ≥0 , a segmentation ground truth ⃗ v i (vector of dimension equal to the number of distinguished tissue classes; here with 0 or 1 entries according to the segmentation ground truth), and a network prediction ⃗ w i (here the vector with tissue class membership likelihoods), where i denotes the pixel index, the smoothed Dice coefficient is defined as: and the Dice loss as DL λ := 1 − DC λ .Usage of the smoothed Dice coefficient was empirically observed to yield better optimization results than the Jaccard index.The smoothing parameter λ was set to 1, which was found to reliably and efficiently identify low-loss configurations.

Inter-operator variability benchmark
To facilitate interpretation of the performance metrics, IOV was computed as a reference benchmark.All three experienced segmentation specialists were asked to distinguish the tissue classes used in this study in the same twelve slices of the validation dataset (six from LS1 and six from CS1).Their segmentations were compared pairwise using the Jaccard index and the Hausdorff distance and the comparison results were averaged to provide the IOV baseline.

Nerve recruitment predictions
Nerve recruitment predictions were based on electromagnetic simulations (see below) of the exposing electric potential ϕ, from which E-field and current density are obtained as ⃗ E = −∇ϕ and ⃗ j = σ ⃗ E (σ: conductivity).The second derivative of ϕ along an axonal trajectory correlates with membrane charge accumulation rate (and thus action potential initiation) [41].
As such, the maximum 2 along the trajectory of the ith fiber l i (where k i is a fiber parameter-dependent proportionality constant) may be used to gauge stimulability; this quantity is referred to as the activating function (AF).It was first proposed by Rattay [41] as an analytical measure for predicting action potential initiation in axon models.

Sim4Life.io simulations
We conducted simulations of the electric potential field distribution arising from a normalized stimulating current (1 mA) applied through a 0.3 mm 2 electrode located on the surface of the epidural fat near the midpoint between the cranial and caudal spine domain boundaries (i.e. one of the central electrode array contacts).Relative neural stimulability was assessed using the AF concept described above, i.e. fibers with higher peak AF values were considered more likely to fire a spike in response to an exposing potential.This assessment is specific to the currents injected through the different electrode contacts, but only weakly depends on stimulation parameters such as pulse strength, pulse width, and input frequency.All electromagnetic simulations were carried out using Sim4Life.iov7.2's [66] structured, Ohmic current-dominated electro-quasistatic solver, which solves the heterogeneous Laplace equation ∇ (σ(x)∇ϕ ) = 0 with suitable boundary conditions.Anisotropic conductivity tensors for white matter tracts were inferred from current diffusion simulations to account for the significantly enhanced conductivity in the longitudinal direction [54].For the other tissues in the model, heterogeneous dielectric property maps were assigned using a validated tissue properties database [14].Simulated electrical stimulation was applied to the spine using a 0.2 mm-thick, rectangular, silicone lead paddle with an array of 24 contacts arranged in three rows (see figure 1).For a detailed description and discussion of the electromagnetic modeling, see [22,23].In this study the fiber trajectories l i were generated using the following procedure: • Segmentation label-fields were cropped along the z-axis to provide a flat tissue surface at the cranial and caudal ends of the spine.• A homogeneous diffusion simulation was run for each model, with Dirichlet boundary conditions assigned to the cranial (value of 1 au) and caudal (value of 0 au) surfaces of the white matter, dorsal roots and ventral roots, and insulating Neumann boundary conditions to the remaining surfaces.• Diffusion streamlines were extracted using VTK routines [48] from seed points located on the cranial and caudal surfaces of the white matter, dorsal roots and ventral roots.• After discarding those of insufficient length (<8 mm for the lumbosacral model, <4 mm for the cervical), the streamlines were imported as splines into Sim4Life and modeled as 6 µm Sensory MRG axons [11] in accordance with the fiber diameter distributions used in [22,23].• The union of the white matter and root tissues was computed for each segmentation workflow variant, and the intersection volume of all these unions was determined.• Only fibers fully contained within this intersection volume were retained.
The above procedure resulted in two sets of fiber trajectories (one for each of the two spinal regions) that are compatible with all the tissue models produced by each segmentation workflow, thereby permitting the comparison of fiber activations across conditions (see figure 1).

Statistics
Statistical significance of segmentation differences was assessed using the two-sided Mann-Whitney-Wilcoxon test [36] with α = 0.05.Throughout the work we use the following notation to indicate the significance level based on the p-value: ns

Network architecture
A popular choice of neural network architecture for 2D (and occasionally 3D) image segmentation is a As ResNets typically require larger datasets than available here to produce stable results, we selected a HarDNet-inspired architecture [9] instead, as this architecture exhibits good performance on smaller datasets.HarDNet blocks are characterized by an internal layer connection scheme that is optimized for efficient GPU memory bandwidth utilization [9].Using fewer weights was found to result in strong convergence acceleration, while simultaneously enhancing accuracy.For segmentation applications, Huang et al [19] combined the HarDNet architecture with a decoder consisting of receptive field blocks (RFB) and Dense Aggregation [31] to produce an autoencoder for binary masking ('HarDNet MSEG').In the same work, it was demonstrated that HarDNet MSEG not only produces faster predictions as a result of the optimized HarDNet (and RFB) wiring, but also outperforms all other network classes in terms of accuracy (mean DICE metric), likely due to the optimized block connections.Furthermore, the authors conclude that HarDNet MSEG works well with limited data, making it an ideal candidate for the present application.

Implementation
We implemented our artificial neural network using Google's TensorFlow [2] library.A specific implementation of the original HarDNet MSEG is defined by the choice of the corresponding HarDNet backbone (typically 39, 68 or 84) and the number of residuals filters of the RFB decoder (typically 16 or 32).In our implementation, several additional modifications were made to achieve a higher pixel-level segmentation accuracy (see figure 2): (i) Additional blocks were connected from the encoder (situated after the max pooling operations) to the decoder to transmit information prior to downsampling.The combination of layers in the final aggregation step was reformulated to incorporate the newly added connections.We denote the number of extra connections by a '−N' at the end of the model name, where N ∈ {0, 1, 2}.Thus, N = 0 denotes the unmodified (original) encoderdecoder connectivity.(ii) The final dense aggregation layer was removed as it tended to reduce pixel-level detail.In addition, the kernel size of the penultimate layer was increased from 1 to 3 to improve the level of detail during aggregation prior to upsampling.(iii) Upsampling was adjusted for the number of layers used in the network architecture.Specifically, '−0' models were upsampled by a factor of 4x, '−1' models by 2x and '−2' models were not upsampled at all.(iv) A softmax layer with seven filters (background + 6 tissues) was appended at the end to predict tissue type probabilities.Each pixel was assigned the most probable label.
We will refer to the HarDNet MSEG network with the above alterations as 'modified HarDNet MSEG' .The notation HarDNet MSEG 68-32-1 indicates a HarDNet68 encoder with 32 residual filters in the decoder and 1 extra connection from the lowest layer in the U-Net layout.The modifications in the dense aggregation and the upsampling are implicitly included.

Training 2.6.1. Data augmentation
Data augmentation, i.e. the creation of synthetic image variants, was used to compensate for the limited training data and increase segmentation robustness.Various augmentation strategies have been shown to boost the ability of a network to generalize to other datasets [49,60,61].Intensity, brightness and other grey-scale transfer function augmentations are less important, as such differences are already handled through the histogram normalization step, and only affine augmentations (shifting, rotation, scaling) were employed.
Training data was augmented using the Python library albumentations [7] because of its compatibility with TensorFlow.We favored augmentation ratios in the range a ∈; i.e. for each image in the training dataset, a new image variants were created by applying transformations to the original image with parameters randomly sampled from the transformation parameter space (see table 1).The parameter space was chosen to ensure that the spine never intersects with the image borders.Unassigned voxels outside the transformed image were zeroed (i.e.assigned the background label).

training details
Backpropagation was performed using the adam [25] optimizer implementation distributed with TensorFlow.Training slices were shuffled after each epoch to eliminate data order bias.A default learning rate value of 1 ×10 −3 was used for all epochs, apart from the last ten.There, the learning rate was reduced exponentially to reach a final value of ×10 −4 at the last epoch.This helped to learn finer details once network stability had been achieved.Various combinations of hyperparameters were compared in terms of their influence on convergence and generalization; those above were found to offer the best performance on the validation data sets.

Postprocessing
Applying a trained network slice by slice along the principle axis of a 3D spine image dataset may result in discontinuities.Furthermore, misclassification can lead to holes and isolated islands in the segmentation.To avoid the need for manual intervention, an automatic clean up routine was developed.This routine proceeds by identifying connected components for a given label and discarding all but the largest one(s) based on the expected anatomical topology (see table 2).Per tissue, connected structures were identified using a maximally diagonally connected component analysis and sorted by size.The resulting list was culled according to the heuristically established rule that a connected component is only retained if its size is at least 50% that of the previous component (i.e. the component is comparable in volume to previously retained structures from the same tissue class) or if the anatomically expected number of components has not yet been identified.Discarded components were relabeled later by identifying the unlabeled pixel nearest to already labeled pixels, and assigning it the label of its nearest neighbor.This process continued iteratively until all pixels from discarded regions (typically <0.5% of the masked volume) were reassigned.This clean up procedure results in a single, connected background component, and ensures that background labels are excluded from the spine interior.

Training pipeline
The overall image segmentation pipeline (see figure 3) begins with the application of an image preprocessing workflow to each of three input channels, one per imaging modality (figure 3, blue box).We refer to the set of preprocessed outputs with their respective ground truth segmentations as a dataset.
After preprocessing, these datasets are fed into the learning workflow.
The learning workflow begins with bias correction and masking followed by Nyul normalization, in which the histograms of the pixel intensities of all datasets are rescaled with respect to a reference dataset (arbitrarily chosen to be the first dataset).Subsequently, dataset slices are assigned to either training or validation groups.

Investigations 2.9.1. Architecture comparison
We investigated differences in segmentation performance between our modified HarDNet MSEG architecture and several alternative state-of-the-art encoderdecoder U-Net backbones.In particular, using implementations available in the open-source, Pythonbased Segmentation Models library [20], we compared our network against ResNet34 [15], ResNet50, ResNet101, Vgg16 [50], and Vgg19.The list of network architectures compared here includes those that were previously shown in [19] to be among the best performing U-Nets for image segmentation.Furthermore, these networks are utilized in the implementation of the nnU-Net framework [21], a state-of-the-art segmentation tool trained specifically for medical imaging applications.As such, our selection of network variants provides a fair benchmark for gauging the performance of the modified HarDNet MSEG architecture.For each architecture, 100 epochs of training were repeated on masked, nonaugmented LS1 data 25 times (with random network parameter initializations).By fixing the number of epochs, rather than training to saturation, the efficiency of network learning using backpropagation can be compared across different architectures.The quality metrics from section 2.4 were computed to assess the performance of each network.This corresponds to the baseline route in figure 3.

Generalization and transferability
The ability of the network trained on one dataset to segment another dataset with and without additional training was investigated by comparing: • baseline: Segmentation performance, assessed in terms of quality metrics, on the same dataset (LS1 or CS1) used for training.Again, 100 epochs of training were repeated for 25 random network weight initialisations with normalization and data augmentation (a = 2).An overview of the full pipeline can be seen in figure 3. Performance metrics were computed on validation slices from the lumbosacral (LS1) and cervical (CS1) regions.The procedure was repeated with and without masking to investigate its impact on baseline, generalization and transfer performance.

Impact analysis
The influence of each component of the segmentation pipeline on segmentation quality and nerve recruitment predictions was quantified for both LS1 and CS1.This was achieved by skipping or replacing individual steps of the pipeline (see table 3) to produce several pipeline variants as shown in figure 3; a single bout of 100 training epochs was performed for each variant.
The AF (see section 2.4) was used to evaluate the impact of segmentation quality on fiber recruitment predictions.To this end, we generated scatter plots (see appendix) comparing the logarithm of the peak AF magnitudes of every fiber for each segmentation workflow variant against those produced by the full workflow.The decision to compute the logarithm of the peak AF reflects the expectation of a log-normal distribution of the electromagnetic exposure.A linear fit with a fixed slope of 1 was performed on the AF logarithms so that the exponential of the intercept (β) is a measure for the relative magnitude of the AF predictions, and the coefficient of determination (R 2 ) for their equivalence up to a scaling constant.To assess the significance of the β deviation from 0 (i.e.whether the segmentation approach results in significantly different stimulation threshold magnitude predictions), the standard error of the intercept from the fitting procedure was used to assess whether β's 99%-confidence interval encompassed zero.

Architecture comparison
We first sought to identify the specific CNN architecture achieving the highest possible segmentation accuracy from among a list of candidates, as shown in tables A1 and A2.Segmentation accuracy was considered on a per-tissue basis, and for the mean across tissues.The results in the tables and figure 4 demonstrate that HarDNet-68-32-1 significantly outperformed the other network architectures considered.This architecture exhibited the highest mean Jaccard index(0.840) and lowest mean Jaccard variability (0.005), implying both high performance and consistency.Furthermore, of particular relevance to predictions of neural recruitment, HarDNet-68-32-1 excelled at segmenting fine geometrical structures, in particular the ventral and dorsal roots; dorsal root activation is known to be an important factor determining the efficacy of SCS for functional recovery [44].Although our implementation of HarDNet-68-32-1 required approximately triple the training time training of ResNet34 (4.19 min) in absolute terms, the duration (14.4 min) was short enough to be of little practical concern.

Inter-Operator Variability
The variability of inter-operator segmentation performance per tissue class can be seen in figure 5 for the lumbosacaral (LS1) and cervical (CS1) regions.These values were used as comparative performance benchmark throughout.The difference between IOV scores for LS1 and CS1 (table 4) was large when quantified by Jaccard index (0.853 ± 0.007 vs. 0.778 ± 0.013), but not by Hausdorff distance (201 ± 7.5 µm vs. 200 ± 6.9 µm).As expected (see section 2.4), the Jaccard indices for CS1 were higher (i.e.better) in csf, epidural fat and gray matter (i.e.bulkier tissues), where contour deviations carry proportionally less weight; the Jaccard index is sensitive to tissue area/volume, while Hausdorff distance is not.Surprisingly, this trend was not observed within LS1.Hausdorff distances for the CS1 dorsal/ventral roots, which house the afferent/efferent axonal projections, were below those (i.e.better) of the bulkier tissues, with important implications for the accuracy of nerve stimulation predictions.
As figure 5 illustrates, CNN segmentation performance was comparable to or better than the IOV when masking was applied.This implies that the performance of our pipeline is limited by the inherent uncertainty associated with human labeling of the underlying images, and no further improvements without improved ground truth data can be expected.In addition, the CNN is potentially more consistent and objective in its label determinations than expert human operators.The small but notable superiority of network performance as compared to IOV may be explained by the CNN adopting the idiosyncratic 'style' of the operator on whose data it was trained, thus inflating Jaccard/Hausdorff scores  on test/validation datasets whose ground truth labels were provided by that operator.In contrast, we expect the segmentation styles of different humans to be uncorrelated, producing relatively weaker IOV scores.

Generalization and transfer learning
Figure 5 illustrates segmentation performance, generalization, and the impact of transfer learning in terms of mean Jaccard index and mean Hausdorff distance.Metric means were computed as unweighted averages over all tissues classes for the lumbosacral (LS1) and cervical (CS1) datasets.The analysis was repeated with and without masking.The corresponding numerical data can be found in tables A3 and A4, respectively.Corresponding tissue-wise results are shown in figures A3 and A4.Statistical significance is reported in table A5.Segmentation quality generalized well for novel data acquired from the same anatomical region of a different subject; mean Jaccard index dropped from 0.84 to 0.73 (with masking).In contrast, generalization performance was limited for novel data originating from an anatomical region not present in the training dataset; mean Jaccard index dropped drastically to 0.18 and the Hausdorff distance approximately doubled.While masking on its own only moderately improved network performance, generalization was significantly degraded in the absence of masking, highlighting its importance for maintaining performance on novel datasets.

Impact of pipeline steps
The impacts of each pipeline step and the choice of backbone on segmentation quality and nerve recruitment (AF peaks) for cervical and lumbosacral datasets are depicted in figure 6(see table A6 for corresponding data).Segmentation performance of the full pipeline (HarDNet backbone) was relatively insensitive to the presence or absence of data augmentation.Furthermore, all tested workflows performed with high absolute accuracy; in the lumbosacral region, mean Jaccard indices ranged between 0.814 and 0.834, and mean Hausdorff distances between 178 and 187.In the cervical region, these performance metrics spanned 0.742-0.789,and 178-191, respectively.Despite the low variance of segmentation performance across workflow variants, the downstream, functional consequences for axonal recruitment predictions were statistically significant in each case.Deviations in predicted recruitment for each pipeline variant were between 4% and 13% (absolute magnitudes) as measured against the full workflow.

Machine learning for image processing
Transfer learning is used in a wide range of machine learning applications to enhance network training speed and performance [17,64]-the weights of a network trained previously in a separate context (w) are reused for initialization of the network weights to be optimized (w ′ ).As w already encode some geometrical/topological knowledge, they are almost certainly closer to an optimal configuration for w ′ than random values, thus reducing time to convergence.This approach is especially beneficial when available training data is limited, as pre-trained weights need only be 'fine tuned' with the available labels.In this study, transfer learning not only provided rapid convergence during training, but was also found to improve the final segmentation quality.
Segmentation performance was also shown to be sensitive to the choice of network architecture, with HarDNet MSEG providing the best generalization to new data.These results thus support the findings of [19] regarding HarDNet MSEG generalization and accuracy with sparse training data.The pipeline described here achieves comparable or superior results to those shown in [46,52,56] while requiring less data-a benefit to segmentation applications beyond the current study.The improved performance partly results from extending the original network architecture by providing a modified network that reads MR input layers and predicts several output tissues at once.Furthermore, our modification of HarDNet MSEG results in fewer weights, which has been shown to reduce overfitting [51] and improve generalization.
Additional study outcomes include a preprocessing pipeline for the preparation of input data [28] and automated postprocessing routines, both critical for achieving high quality segmentations.Preprocessing was found to benefit notably from masking, which has the additional advantage of facilitating a compressed (sparse array) data representation, thus reducing storage and computational resource requirements.

Clinical application
As discussed previously, the work presented here was motivated by the need for enhanced automation and increased throughput in the modeling phase of a broader study of therapeutic SCS for feline bladder control [23,29].In this context, rapid and robust model generation can facilitate the planning of animal experiments and optimization of individualized stimulation configurations.Ultimately, such experiments target translation to the human spine, where automated workflows for fast, image-based model generation yield a substantial reduction in the effort barrier to creating personalized treatment plans.Precedent for machine learning-enhanced personalized medicine exists, e.g. in cancer therapy planning, where a radiologist's time and expertise are limiting factors, and AI-derived time savings or reductions in knowledge requirements are likely Figure 6.Similar segmentation accuracy across pipelines does not guarantee similarity in predicted fiber recruitment.Assessment of the impact of different segmentation pipeline steps on segmentation performance (top row: Jaccard index, middle row: Hausdorff distance) and nerve recruitment predictions (bottom row).Recruitment prediction impact (bottom row) was quantified as the mean change between the AF peak values (all fibers) computed in the reference (full) pipeline and a given variant.Blue box insets: log-log scatter plot of AF peaks for each fiber as predicted by the reference pipeline (x-axis) and variant (y-axis).See figures A5 and A6 in the appendix for extended data.
to incentivize clinical adoption.To this end, our pipeline features an inspect-and-adapt step, where an iSeg module in o 2 S 2 PARC permits the radiologist to review the segmentations and make adjustments as needed.The same module can also be used to semiautomatically segment additional slices should transfer learning be required.In addition, o 2 S 2 PARC offers a 'Guided Mode' for hiding the complexity of the underlying pipeline and reducing the user interface to a series of simple, interactive steps, thus creating a user experience appropriate for clinical (research) environments.
Considering the morphological and physiological similarities between cat and human lumbosacral spine [26,53], we expect the CNN segmentation pipeline to readily generalize to human MRI data using transfer learning.There are indeed salient anatomical differences between cat and human spines, such as the size of the spinal canal (larger in humans), which results in a thicker CSF layer and larger separation between the roots and spinal cord in humans.However, there is cause for optimism in light of the demonstrated success of transfer learning in generalizing from cat lumbosacral spine to cervical spine, where the CSF is thinner and both the shapes and locations of roots are different.

Benefits
The key benefits of the developed approach include: (1) high quality segmentations rivaling manual expert segmentation in terms of IOV, (2) dramatic acceleration of previously labor-intensive manual segmentation, (3) creation of detailed spinal cord models featuring six distinct tissue types, (4) automated preand postprocessing routines for image preparation, training and segmentation correction/refinement, (5) improved network architecture requiring comparatively little training data and offering superior convergence, (6) analysis of the impacts and benefits of each pipeline step, (7) analysis of both segmentation quality and metrics reflecting predicted physiological impacts (i.e.axonal recruitment).

Limitations and potential future work
Various aspects of this research present opportunities for improvement and extension in future pipeline iterations.First, the depth and breadth of training data was constrained in several ways including: the number of segmented slices per spine owing to the difficulty of manually generating high-quality labels; stylistic variation, i.e. the restriction to a single expert operator; limited specimen number (more could enhance robustness to anatomical variation); and scanner diversity, owing to limited resources and the customized imaging protocols employed.Thus, assessing the pipeline's robustness and generalization to various training data sources remains an open and important challenge.Further gains in robustness, generalization and transfer could be achieved through more extensive data augmentation, e.g. by mimicking imaging noise, extending the affine transformation space, or adding non-affine deformations (this is particularly expected to improve translation across different spine regions).Moreover, generative adversarial networks (GAN), a technique wherein two ML agents compete, resulting in the generation of new data with the same statistics as the training set [10], could be used to synthesize additional slices with the goal of enhancing generalization.
A further strategy to increase segmentation performance and reduce training time could be to use publicly available, pre-trained neural networks.Any network trained on T1-or T2-weighted imaging data with expert labels-even networks that distinguish fewer tissue classes-is a candidate, since the first and last network layers can then be replaced by variants reflecting the required number of input and output channels.The weights of the network could then be used in place of random initialization as a first guess during training.Even if the application of the public network differs substantially from the target application, this strategy is likely to reduce time to convergence, since even a distantly related network's weights are expected to be closer to an optimal solution than random weights (though the risk of becoming trapped in a local optimum must be carefully studied).
While we believe that our pipeline has the potential to generalize to human cadavers (see discussion above), the relative resolution of spinal cord MR imaging in living humans is typically lower, posing a significant challenge to immediate clinical translation.Fine structures (e.g.rootlets) may not be sufficiently resolved to permit accurate segmentationbased reconstructions [44,45].Additionally, 19 F image sequences are not possible in living humans, since this requires substitution of the CSF with a fluorine solution.However, segmentations of cadaver spine images obtained using our pipeline could be used as a template to inform the segmentation of otherwise poorly resolved structures in living humans.In future work, generalization to other imaging protocols, especially those commonly employed in the clinic, should also be studied, likely requiring further investigation of normalization methodologies.
Furthermore, a pipeline for the subject-specific prediction of spinal root activations such as that described herein could be used in future studies to (1) investigate relevant electromagneticelectrophysiological interaction mechanisms and identify exposure and response quantities of interest, (2) design implants with superior performance (e.g.stimulation selectivity, recruitment, power efficiency), (3) guide patient-specific selection among different implants, (4) optimize and personalize stimulation parameters (implant insertion location, current strengths, pulse shapes), ( 5) perform in silico studies on implant safety and efficacy to establish regulatory evidence that guides, complements, or even replaces animal and human trials, and (6) aid in the design of intelligent (e.g.model-based) control strategies.
To our knowledge, this is the first study to quantify the relevance of pipeline design and segmentation quality on downstream nerve recruitment predictions.Ideally, however, the implications of segmentation outcomes would be assessed at the level of functional impacts (bladder control, motion restoration, pain suppression), which could be achieved experimentally or via a reliable physiological response model.The functional consequences of segmentation quality remain to be addressed in future research.

o 2 S 2 PARC
o 2 S 2 PARC was chosen to host our pipeline as it provides convenient access to customized services designed to support all the ML, segmentation, geometric modeling, electromagnetic/neural simulation, scripting, analysis, visualization and optimization functionalities required.In addition, the following o 2 S 2 PARC benefits are of fundamental importance for the larger modeling and planning pipeline for SCS for bladder control, of which this work constitutes one portion: • Rapid prototyping and creation of complex pipelines from a library of reusable building blocks supporting a wide variety of software technologies;

Stimulation optimization
While the subject of this research is the automated creation of spinal cord models from image data, it is worth remarking on the utility of such models; namely, fast and optimal personalized treatment planning through hybrid simulations of the electromagnetic fields produced by neurostimulation devices and consequent nerve fiber activations.Detailed models represent one strategy for mitigating the difficulties of optimizing/personalizing treatment in light of significant inter-subject anatomical and physiological variation, and the curse of dimensionality associated with stimulation parameter combinations.However, other strategies that rely on purely data-driven approaches have also been proposed.For example, in [13], an elegant and efficient control scheme is described wherein two complementary deep neural networks are trained online using the electromyography responses to various stimulation parameter combinations.Advantages of data-driven approaches include: (1) the avoidance of complex computational models (which entail various uncertainties, and may not capture downstream musculoskeletal responses), and (2) the possibility to continuously learn and adapt (e.g.accounting for recovery or injury/disease progression).However, modeldriven approaches offer the benefit of making personalized predictions prior to implantation.This makes it possible to personalize the choice of implant and its placement-the optimality of which has also been shown to be highly subject-specific [45].
We note here that the AF has been used as a fast predictor for SCS optimization in prior studies, e.g. in [35], where it was used to facilitate multiobjective optimization.In previous work, we extended on the classical AF formulation by introducing analytical Green's functions to consider leakage currents, axial conductance, and temporal stimulation dynamics-the so-called generalized activating function (GAF) [37].The GAF provides superior predictive power at comparable computational cost, and offers the possibility of considering (and optimizing) complex pulse shapes and dynamic stimulation protocols.
Despite the significant computational advantages of the AF/GAF, high-dimensional optimization also requires algorithmic sophistication.For example, [6] demonstrated that Gaussian-process (GP)-based Bayesian optimization can outperform traditional algorithms in the high-dimensional parameter spaces typical of high-density neural interfaces, while [37] successfully employed back-propagation to select an optimal stimulation configuration in a model of SCS.Using the GAF, we have also shown that hybrid GP surrogate model (SuMo) generation and SuMobased multigoal genetic optimization can be used to quickly obtain superior pulsing schemes [12].Note that these algorithms are compatible with both dataand model-driven approaches.

Conclusions
We developed an automated MR image segmentation and cat spine model creation pipeline based on a novel, HarDNet-inspired CNN architecture.This pipeline includes routines for image preprocessing, data augmentation, segmentation, clean up and transfer learning.Furthermore, we evaluated the impact of various pipeline steps on tissue segmentation quality (Jaccard index and Hausdorff distance) and fiber recruitment predictions (fiber depolarization estimated via AF).The analysis of predicted nerve recruitment, a therapeutically relevant metric, permitted a clinically meaningful assessment of various pipeline design decisions.Importantly, it was observed that even when segmentation quality is high, small deviations in tissue labels can manifest in significant downstream differences in predicted fiber recruitment metrics.Therefore, the functional predictions of image-based biophysical models are sensitive to subtle variations in tissue geometry, a topic that merits further research.Furthermore, we examined the generalization and transferability of the trained CNN to new image data, including data from different cat spine regions.It was found that transfer learning not only significantly improved generalization, but also enhanced segmentation quality overall.Finally, our pipeline exhibited similar segmentation performance to human experts as inferred from IOV scores.This work helps lay the groundwork for a comprehensive workflow for personalized SCS for bladder control in humans using model-based biophysical simulations.Table A5.Generalization methodology and masking significantly affect segmentation.p-values of the segmentation performance comparisons for the transfer learning investigation described in section 2.9.2).Values below 1% are considered significant and depicted in bold (see section 2.4.5).

(
sup a∈A d (a, B) , sup b∈B d (A, b) ) , where d(a, B) := inf b∈B d E (a, b) and d E (•, •) denotes the standard Euclidean distance norm.In other words, d(A, B) is the maximal separation of the boundaries of A and B and a measure of their similarity.Smaller H values signify better surface fidelity.

Figure 1 .
Figure 1.High-level study workflow.Structual MR images (T2, Fat and 19 F) from different cat spine regions are used as inputs to a CNN to establish subject-specific anatomical models by assigning tissue labels to pixels.The models are functionalized with spinal root axon trajectories, and electromagnetic and neuronal dynamics simulations are performed to predict neural fiber recruitment.

Figure 2 .
Figure 2. Illustration of the modified HarDNet architecture developed for this study.Shown is the network layout of the HarDNet MSEG 68-32-1.The red lines/boxes indicate modifications of the original HarDNet MSEG implementation, and their numbering refers to descriptions listed in the main text (e.g. the red square represents the modified aggregation layer).For additional details see Huang et al [19].
• generalization: Segmentation performance on a distinct dataset (LS2) from the original, without further training.• transfer: Segmentation performance after adapting the weights of a network pre-trained on LS2 by continued training on slices of the target dataset (LS1 or CS1) -i.e. the benefit of pre-training on another dataset.

Figure 4 .
Figure 4. Superior HarDNet 68-32-1 performance.Comparison of segmentation performance metrics for different network architectures.Left: Jaccard index per tissue class (higher value is better) with means in the right-most column.Right: Hausdorff distances per tissue class (lower value is better; note the reversed scale).

Figure 5 .
Figure 5. Human-like performance is achieved with masking.Mean segmentation metrics for the different pipelines compared to inter-operator variability (Top row: Jaccard index; Bottom row: Hausdorff distance; Left column: LS1, Right column: CS1).

Figure 7 .
Figure 7. Overview of the complete Bladder Project pipeline on o 2 S 2 PARC.The yellow box indicates the contribution of this work.* : image data for the AI segmentations (input); * * : segmentation (output); * * * : AF predictions derived from the EM simulations (output).

Figure A3 .
Figure A3.learning improves segmentation performance on the original training data.Masking improves generalization and Jaccard index for all tissue classes.Segmentation performance (Jaccard index) for the individual tissues and their mean when applying the network to data from the same region it was trained on (baseline), when applying it to a different dataset without additional training (generalization), and when employing some transfer learning first (transfer).Results are shown with (b), (d) and without (a), (c) the masking step in the pipeline: (a) Application of a network trained on LS2 to LS1 with Masking, (b) Application of a network trained on LS2 to CS1 with Masking, (c) Application of a network trained on LS2 to LS1 without Masking, (d) Application of a network trained on LS2 to CS1 with Masking.

Figure A4 .
Figure A4.The pipeline performs above inter-operator variability in terms of Hausdorff distance.The Hausdorff distances for all tissues and their mean are shown for: (a) Application of a network trained on LS2 to LS1 with Masking, (b) Application of a network trained on LS2 to CS1 with Masking, (c) Application of a network trained on LS2 to LS1 without Masking, (d) Application of a network trained on LS2 to CS1 without Masking.

× 10 − 3 Figure A6 .
Figure A6.Impact of pipeline steps on stimulation predictions (CS1).Scatter plots comparing the logarithms of the peak AF predictions obtained for the cervical region (CS1) using the full pipeline or one of the variants.

Table 1 .
Data augmentation ranges.Transformation parameter space used as part of the data augmentation.Example images can be found in figureA2.

Table 2 .
The number of connected components per tissue class that are retained during the clean up step.
Figure 3. Segmentation pipeline and workflows for the assessment of transfer learning capabilities.Blue box: Comparison of the pixel intensity histograms across the three spines in the masked region after normalizing with respect to LS2.As evident in the histograms, normalization ensures that pixel intensities are comparable across datasets, thus facilitating network transferability.Bottom: Workflow for extracting baseline, generalization and transfer learning data including all steps of the pipeline.

Table 3 .
Pipeline configurations.Pipeline variants used to assess the impact of different steps and backbone choices on the resulting segmentation performance and nerve stimulation predictions.

Table 4 .
Inter-operator variability baseline.IOV for six slices from two spine datasets, in terms of Jaccard index and Hausdorff distance.These quantities serve as benchmarks for the performance of the automated segmentation pipelines.Pixel size: 50µm.