Patient specific intracranial neural signatures of obsessions and compulsions in the ventral striatum

Objective. Deep brain stimulation is a treatment option for patients with refractory obsessive-compulsive disorder. A new generation of stimulators hold promise for closed loop stimulation, with adaptive stimulation in response to biologic signals. Here we aimed to discover a suitable biomarker in the ventral striatum in patients with obsessive compulsive disorder using local field potentials. Approach. We induced obsessions and compulsions in 11 patients undergoing deep brain stimulation treatment using a symptom provocation task. Then we trained machine learning models to predict symptoms using the recorded intracranial signal from the deep brain stimulation electrodes. Main results. Average areas under the receiver operating characteristics curve were 62.1% for obsessions and 78.2% for compulsions for patient specific models. For obsessions it reached over 85% in one patient, whereas performance was near chance level when the model was trained across patients. Optimal performances for obsessions and compulsions was obtained at different recording sites. Significance. The results from this study suggest that closed loop stimulation may be a viable option for obsessive-compulsive disorder, but that intracranial biomarkers are patient and not disorder specific. Clinical Trial: Netherlands trial registry NL7486.


Introduction
Obsessive compulsive disorder (OCD) is a psychiatric disorder characterized by intrusive obsessive thoughts (obsessions) and repetitive behaviors (compulsions). Approximately 10% of OCD patients continue to experience symptoms despite routine treatment [1]. For those patients, deep brain stimulation (DBS) is an emerging treatment with around 50%-60% responder rate [2][3][4][5]. In DBS electrodes are implanted in specific brain regions which can be stimulated with electrical pulses. For OCD, electrodes are most often implanted in striatal regions such as nucleus accumbens (NAc), ventral capsules/ventral striatum and ventral anterior limb of the internal capsules (vALIC) [6,7]. Typically, DBS delivers the electrical pulses continuously at constant level, whereas in closed-loop stimulation the target region is stimulated in response to some neural activity. Continuous constant stimulation requires adjustments of stimulation parameters until maximal treatment response is achieved, while closed-loop stimulation could adapt its stimulation response automatically to a biomarker. This requires a device that can both stimulate and record neural activity, as well as be programmable with models to react to neural activity. Such devices have become available in recent years including the Activa PC + S prototype [8] and the recently available Percept PC, both from Medtronic.
Closed loop stimulation has been explored in movement disorders such as essential tremor and Parkinson's disease [9][10][11]. Two recent studies using a small sample of essential tremor patients, have demonstrated that it is possible to perform closedloop stimulation that has the equivalent therapeutic efficacy of continuous stimulation with reduced energy demands [12,13].
However, closed-loop DBS stimulation for psychiatric disorders is still in a relatively early stage of development, largely in part due to the lack of brain biomarkers that are unique to symptoms [14]. Only two case-studies in OCD using chronic local field potential (LFP) recordings from PC + S have been published. In the first study, data from a patient with subthalamic nucleus leads [15] revealed theta oscillations in the ventral subthalamic nucleus that were not found in the recordings from a comparison group of Parkinson's disease patients. In the second casestudy, a PC + S device was used to simultaneously record and stimulate in the supplementary motor area and ventral capsule/ventral striatum in a patient with OCD [16]. While the patient reported subjective improvement, no such improvement was seen in the clinical rating scales.
In the current investigation, we endeavor to find unique neurobiological signatures of symptoms in a dataset of 11 patients with implanted DBS electrodes capable of LFP recordings. To elicit OCD symptoms, we used a symptom provocation paradigm that was inspired by the induction of symptoms during exposure therapy. We investigated whether a candidate biomarker could be derived using deep learning on time series recorded from DBS electrodes. For our goal of time-series classification, a deep-learning model has previously been found to yield compelling results [17].
We specifically tested if we could predict the symptomatic state from baseline using the recordings from the DBS leads implanted in the vALIC. We further explored whether a biomarker would be generic across patients or patient specific. This allowed to directly investigate if one model could be built on the group-level to predict symptom states in a new patient, and if a patient-specific model could predict symptom states in the same patient at a later time.

Participants
Eleven patients with treatment-refractory OCD were recruited from the outpatient clinic for DBS at the department of Psychiatry of the Amsterdam University Medical Center, location Academic Medical Center (AMC), in Amsterdam from 2015-2020. Patients were eligible either if they had an existing DBS system implanted and had responded to treatment and needed a stimulator replacement (one patient), or if they were indicated for implantation of a DBS system (ten patients). Exclusion criteria were alcohol or substance abuse during last 6 months. The study was approved by the medical ethics board of the AMC and all patients consented to participate in this study and signed an informed consent form. The trial was registered in the Netherlands Trial Register (Trial NL7486).

Symptom provocation experiment
Two to three weeks after implantation or stimulator replacement the patients underwent a symptom provocation task. The DBS implantation is described in the supplementary. For ten of the patients the therapeutic DBS was turned on for the first time after participation in our experiment. In one patient (Patient 2), due to a scheduling conflict, the therapeutic DBS was turned on before participation in our experiment. For this patient, the therapeutic stimulation was switched off while they were doing the experiment.
Prior to the start of the experiment, all participants had their symptom severity assessed using the Yale-Brown Obsessive Compulsive Scale (Y-BOCS) [18], the Hamilton Depression Rating Scale (HAM-D) [19] and the Hamilton Anxiety Rating Scale (HAM-A) [20] for OCD, depressed mood and anxiety. Then the patient underwent four rounds of a symptom provocation sequence ( figure 1(a)). Here the first the patients were instructed to sit and watch a neutral movie for three minutes while LFPs  were recorded. Then obsessions were induced in a patient specific manner (table 1). For example, patients with contamination obsessions were asked to touch the floor. Once obsessions had been induced the patient was instructed to sit and focus on their obsessions for three minutes while LFPs were recorded. Then three minutes of recording followed where the patient was allowed to act out their compulsions, e.g. washing hands for contamination obsessions. After feeling relieved of the compulsions the patient was instructed to sit calmly while LFPs were recorded. Recordings were done in left and right hemisphere simultaneously. Before and after each task the patient rated their symptoms on a visual analogue scale (VAS) measuring anxiety, agitation, mood, obsessions, compulsions and avoidance. The visual analog scales were entered into a repeated measure ANOVA to test for interaction and main effects of symptoms (six items) and time (seven VAS measurements). Main effect of rounds (four items) was tested to account for effects of habituation. If an interaction was significant, one way post-hoc ANOVAs were carried out to further elucidate the effects of symptoms and time.

Data acquisition and modeling
LFPs were recorded using the 8180 Sensing Programmer SW (Medtronic Inc.) with a sampling rate of 422 Hz using 0.5 Hz high pass filtering and 100 Hz low pass filtering [8]. Due to amplifier settling the first 1000 samples were removed from each recording. Furthermore because of the presence of a marker signal at 104.5 Hz to measure channel saturation [21,22], the signal was band pass filtered between 3-99 Hz using a zero phase finite impulse response filter using MNE-python [23]. Three Hz was chosen as a high pass threshold to minimize artefacts. Further, there were non-physiological artefacts in the frequency bands between 47-53 Hz in three patients so a zero phase finite impulse response band stop filter was used to filter those bands out in all patients. Since the LFPs were low pass filtered below 100 Hz we downsampled the signals by a factor of two to speed up computations. Before building our prediction models we wanted to see if we could build predictive features from group level differences in frequency band power across the different states. We used Welch's method to calculate power in delta (0.5-4 Hz), theta (4-8 Hz), alpha (8)(9)(10)(11)(12)(13), beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), low gamma (30-50 Hz) and high gamma (50-100 Hz). The power was calculated per 3 s samples and normalized to overall power in the sample and then averaged within and across channels, resulting in one power value per round, band, side and state for each subject. The results were then entered into a repeated measures ANOVA where state, band, rounds and side were within-subject factors. For significant interactions post hoc tests were performed testing changes from baseline and multiplicity correction was done using Bonferroni.
Since this experiment was part of a larger study there was baseline or resting state data available from several other experiments. The baselines from this experiment as well as baselines from other sessions in other experiments the patient performed were used for predicting from which patient the data was from. We performed this experiment to identify whether time series would be sufficiently comparable to enable group-level state predictions. These other baseline data consisted of resting state data from the outermost contact pair where the patient sat still for 3 min with eyes either open or closed. The amount of baseline data available on average per patient was 10.3 min (SD: 4.6). The baselines were split into training and test set by using the experiment with smallest amount of data as test and all other sessions as training. The training set was then sampled randomly 300 times and the test set 100 times per patient. The duration of the samples was three seconds and was chosen using the validation set performance from half, one, three and five seconds. Twenty percent of the training set was used as a validation set for model selection.
For group level state prediction, the data from the described experiment were used. One patient was selected randomly as the test patient and one patient as the validation set. This was then repeated for all sets of test patients. We drew 180 random three-second samples per round per state, in total 2880 samples per patient for the training set. This means we used oversampling by a factor of three for the training set as a form of data augmentation. For the test set 60 samples per round per state were used or 240 samples per patient. For single patient state prediction, the first three rounds in the experiment were used for training and the last round for testing. Twenty percent of the training set was used as validation set. An overview of the training procedure for all tasks can be seen in supplementary figure 2. The number and size of samples was the same as for the group level prediction. In all experiments both data from the left and right electrode were used. The recordings were done in 'montage sweep' mode where the recording device sweeps through each available recording configuration consecutively. Each three-minute task in each round consisted of 30 s with measurement from each channel configuration from the electrode. There were four contact points and thus six independent bipolar channels measured consisting of slightly different positions in the patient's brain, thus six configurations each measuring for 30 s for a total of 3 min. Each channel measures the voltage between two contacts on the electrode. If the four contact points are numbered 0, 1, 2 and 3 from the most superficial one to the deepest one. Then the sequence of channel recordings for each 30 s was 0-1, 0-2, 0-3, 1-2, 1-3 and 2-3.
First, to test if the computed power in the frequency were predictive we used gradient-boosted trees implemented using the catboost package [24]. For hyperparameter tuning we searched over the number of trees, depth of each tree, l2 leaf regularization and border count (splits per numerical feature) on the validation set. We used permutation tests to access significance. The targets were permuted separately within the train and test set. A permutation test provides only approximate p-values unless all possible permutations are computed for an exact test [25]. A sequential approximation to the general permutation test (SAPT) was used to stop permuting when adequate number of permutations had been done to either reject or accept the null hypothesis of no difference [26,27]. This method controls the type I error rate while achieving power close to the exact permutation test but with a significant reduction in number of permutations. For selected models we explore feature importance using Shapley additive explanations (SHAP) [28,29]. The deep learning model used was InceptionTime [17] which has state-of-the-art performance on raw time-series classification. This model took as input the raw time-series from each sample. The model consists of inception blocks and residual blocks. For the inception blocks, convolutional filters of three different sizes and a max pooling filter are simultaneously slid over the input. The results from these operations are then concatenated to form the input to the next block. There are three inception blocks in one residual block. The input of each residual block is added to the output, forming a residual connection. After the residual blocks a global average pooling layer was used and then a fully connected layer with the number of neurons the same as the number of classes being predicted. All model parameters were found using hyper-parameter tuning on the validation set of the subject prediction task.
The model was trained in PyTorch using a learning rate of 1 × 10 −4 and a batch size of 128 using the Adam optimizer [30]. If the validation loss stopped decreasing for three epochs the learning rate was reduced tenfold and if the validation loss stopped decreasing for six epochs the training was stopped. As in Ismail Fawaz et al [17], for every task we trained five instances of the model with different random initializations and the outputs were averaged for the final predictions. This is in line with Mehrer et al [31], who show that predictions are more stable when using ensembles of deep learning models. Permutation testing was used to assess significance, using SAPT as for the boosted trees. Targets were permuted within the train and test set separately. To further test the capabilities of the patient-specific models consecutive five samples were drawn from the test set. The total amount of samples drawn was reduced by a factor of five so it was the same in seconds as before. The model used for this temporal ensemble is the already trained model on that patient's training set. This model is used to create predictions on five consecutive samples from the time series, which should all have the same label. Then we used majority voting among those predictions to decide which label we will predict for all the five samples. Linked permutations were used where the labels were permuted and both methods were used on the permuted labels to build a null distribution of the difference between the methods.
To analyze differences in performance by location the approach of Horn et al [32] was used. A pre-surgery anatomical T1 MRI was linearly coregistered to a post-surgery CT using Advanced Normalization Tools (ANTs) [33]. These were then normalized into the ICBM 2009b NLIN asymmetric MNI space [34] using ANTs. The electrode trajectories and contact positions were reconstructed using the PaCER method [35]. The atlas used for visualization was from Pauli et al [36]. The model sensitivity for obsessions and compulsions was calculated per hemisphere for channels using measurements from adjacent contact points on the test set. For four contact points this results in three measurements per patient per hemisphere. Each measurement was mapped to the midpoint between the electrode contacts in the common space. The measurements were then interpolated using a scattered interpolant and the result smoothed with a Gaussian kernel of 0.7 mm. The resulting map was then visualized using Lead Group [37].
To assess the clinical utility of making decisions about when to stimulate vs always to stimulate decision curve plots were used [38,39]. The scenario of detecting and stimulating in response to obsessions was used since decision curve analysis requires a binary prediction problem. The net benefit of making decisions about when to stimulate vs always stimulating was assessed. A higher net benefit indicates the model captures more true positives without increasing false positives.

Symptom provocation
Patient demographics, responder state and medication use can be seen in table 2. As can be seen from figure 1(b) all the measured VAS scores except mood and avoidance rose sharply during the induction of obsessions. After the patients were allowed to act out their compulsions the scales start to lower and when relief is achieved the symptoms have approximately returned to their baseline levels. Repeated measures ANOVA showed a significant symptom x time interaction (F(30) = 11.2, P < 0.001). There was no significant effect of habituation across rounds (F(3) = 0.63, P > 0.05). Post-hoc one-way ANOVAs for each symptom revealed that anxiety, agitation, obsessions and compulsions significantly changed over time (P < 0.001, Bonferroni corrected), whereas mood and avoidance did not.

Group effects of spectral power
The interaction of frequency band and state was significant (F 15,135 = 2.253, p < 0.007). Post-hoc test revealed power in all bands increased during compulsions and relief compared to baseline (figure 2). The increase was more prominent in lower frequencies like delta and theta.

Patient prediction
We first explored whether the individual patients could be identified using data obtained at rest using either frequency power and the boosted trees or the deep learning model on the raw time-series. For the boosted trees the overall balanced accuracy was    figure 3). This indicates that some patients have a highly specific neural signature, whereas others do not and that we did not have enough subjects to capture the between subject differences in our model.

State prediction
Next, we tested whether the models could distinguish between the symptom states across patients. We used data from the symptom provocation experiment. Data for each patient was tested using a model that was trained on data from the other patients in an iterative manner. For the boosted trees the balanced accuracy was 30% (SD: 4%) while for the deep learning the balanced accuracy was 31% (SD: 6%). Chance level was 25% for the four states. Since these models were computationally expensive to run and the accuracy close to chance level we did not run permutation testing. We then tested whether patient-specific models could identify the symptom state a particular patient was in. For this four-state problem we used balanced accuracy to evaluate the results. However to further gain insight into the performance for each state, the area under the receiver operator curve (ROC-AUC) was used per state. Since that measures is only defined for binary classification this requires looking at comparisons between each state of interest vs. the rest. For the boosted trees the balanced accuracy was 32.5% while for the deep learning the average balanced accuracy across patients was 38.8% with considerable variation across patients (see table 3 and  supplementary table 1). State prediction with deep learning was significantly better than chance level in 9 out of 11 patients as assessed with permutation testing. The top and bottom 2.5% quantiles of null distribution per patient are found in supplementary table 2. The models do best for compulsions with an average AUC of 78.2%, followed by obsessions with 62.0%. ROC-AUCs for baseline and relief are on average 58.7% and 59.7%. For the patient with the best performing model, patient two, the results were driven by good performance on baseline and obsessions. For the patients with the next well performing models, patients 4, 5, 8 and 11, the performance was driven by good performance on the compulsions. For the boosted trees the models did best for compulsions with average ROC-AUC of 66.3% while performance for other states was lower (supplementary table 1).
We next tested whether ensemble learning could improve model performance for the deep learning model. Five consecutive samples were ensembled together and majority voting was used to determine the prediction for test data. Linked permutations showed that this procedure improved balanced accuracy significantly in three patients, decreased significantly in two patients, and did not change significantly for the other patients ( figure 3). See supplementary table 2 for top and bottom 2.5% quantiles of the null distributions. In patient two the accuracy reached 67% and the ROC-AUC for obsessions reached 94.2% and for compulsions the ROC-AUC was 98.1%. Patient 4 reached accuracy of 54% while the ROC-AUCs for obsessions and compulsions were 73.1% and 86% respectively.

Localization of deep learning model performance
In figure 4 the performance by location for obsessions and compulsions can be seen. For obsessions the  highest sensitivities were on the left side and seemed to be higher in gray matter such as the globus pallidus externa (GPe) and the NAc. For compulsions, the sensitivities were also higher on the left side and the peak performance was in the white matter just above the NAc, which incidentally was an area of low performance for obsessions. In supplementary figure 4 the same model performance maps can be seen with active contact points of responders and non-responders overlaid. While it is hard to see a clear pattern some of the responders active contacts do overlap with the cluster of high sensitivities on the left side in obsessions.

Cardiac artifact
In patient two there was a heartbeat-like artifact in the left hemisphere in all channel pairs that included contact number zero. This was a known issue in the  Activa PC + S because of a fluid leakage past a seal on top of the stimulator which affected contact points zero and eight most [22]. The beats per minute were approximately 78-94. Since this artifact could potentially be correlated to symptom states the predictions for patient two were repeated with the left artifact free channels copied over the channels with the artifact in the training set and the model re-trained. This had little influence on the accuracies on the test set, with an overall accuracy of 53.8% and state specific ROC-AUCs of 91.2%, 80.9%, 92% and 49% (baselines, obsessions, compulsions and relief).

Feature importance for boosted trees
Since the boosted trees models use power in feature bands it lends itself to feature importance methods like SHAP to analyze how the frequency power is affecting the model output. In figures 5 and 6 mean absolute SHAP values are plotted for obsessions and compulsions. The SHAP values are quite different per patient and no obvious pattern is observable.

Clinical utility
The net benefit of various models for patient 2 can be seen in figure 7. Higher net benefit indicates more true positives are captured by the model without increasing harm due to false positives. We compare the boosted trees based on frequency features with the deep learning model with and without temporal ensembling. The boosted trees are better than always/never stimulating from about 8% up to about 32%. InceptionTime is better than the boosted tree and always/never from about 5% up to around 46%. Although the temporally ensembled model does improve discrimination it only improves clinical utility above a high threshold of 40%.

Discussion
In the current investigation we aimed to identify a biomarker that is suitable for the use of closed loop DBS in OCD. We induced symptoms in 11 OCD patients using a symptom provocation paradigm and trained boosted trees on frequency power features and a deep learning model on time series recorded from the electrodes used during DBS for OCD. Our main finding was that models based on data from multiple patients were hardly better than chance. Models for individual patients were more successful in identifying different symptom states in majority of patients, though with low to moderate accuracy. The deep learning models were considerably better than the boosted trees models, even though there were group level differences in frequency band power between compulsions and relief vs baseline. Our results suggest that LFP signatures of obsessions and compulsions are highly patient specific. This study is the first of its kind in using machine learning algorithms on deep-brain recordings in a group of psychiatric patients.
We did find group effects of increases in frequency power across all bands during compulsions and relief. For compulsions the most increase was in the lower frequency bands such as delta and theta. For relief, although significant, the increase was minor for all bands. Since the nature of the compulsion state involved performing a task, such as washing hands, these differences could be driven by motor activity which need to be ruled out in future experiments. However the obsession task did not suffer from this limitation. The subject specific models built using these frequency features only performed moderately well for compulsions but poorly for the other states. Although there were differences in frequency power also for relief, the effect was small and the predictive models did not pick up on those differences. The feature importances using SHAP show no obvious pattern, indicating that the signatures the models are using can be quite different between patients. However due to the poor performance of most of the boosted trees models it's hard to generalize these findings. The overall performance of the deep learning models was significantly better than for the boosted trees. This suggests that using the raw timeseries is better than hand-engineering features for use in the model. However, this also makes the interpretation of the models harder due to the black box nature of the deep learning models. Future work needs to take this into account and look at black box explainability methods to visualize and interpret the model performance.
The performance of our patient-specific models and the identification of compulsive and obsessive states indicates that the model is picking up on an underlying signal of cognitive and/or affective processes. Since obsession and compulsion signals were differentially detected it might mean that there are different underlying signals for these symptom states in the activity of the Ventral Striatum. This is further corroborated by the location of model performance map in figure 4 where the peak in the compulsions on the left side corresponds to low performance for obsessions, which therefore seems to relate to different neural sources. The electrode leads were targeted so that the most ventral contact points were located in the NAc. The other contact points were in the white matter in the vALIC as well as the GPe (see figure 4). All the recordings were in a bipolar configuration where the signal is dominated by neurons local to the electrode and the spatial reach of the LFP signal is in the order of two to five mm, where the distance between the recording contacts increases the spatial reach [40]. For analyzing performance of the model by location only adjacent contacts were used so the spatial reach was highly localized. Thus, it seems our models are most accurate in predicting obsessions when using signal from gray matter in the GPe and NAc. Looking at how active contact points of responders overlap with this performance using supplementary figure 4 some of the active contact points do overlap with clusters of high performance for obsessions on the left side but others do not. Due to the low absolute number of responders, it is hard to extract a meaningful pattern from this overlap. The GPe and NAc gray matter signal, where model sensitivity for obsessions is high, reflects the input to those regions [41]. Both those regions are part of the cortico-striato-thalamo-cortical (CSTC) loops which are believed to be central to the pathophysiology of OCD [42]. The CSTC receives projections from regions in the frontal cortex which terminate in the striatum and then travel by either a 'direct' (net excitatory) or 'indirect' (net inhibitory) pathways to the thalamus before returning to the frontal cortex. It has been hypothesized that in OCD the network is biased towards the direct pathway resulting in failure to inhibit abnormal and repetitive behaviors [43]. The NAc is part of the input to the CSTC network while the GPe is part of the indirect pathway, though recent studies suggest a more complicated role for the GPe where it links information from both direct and indirect pathways [44]. How this fits our results is not clear although these are close and interconnected structures.
For the compulsions the best accuracy was in the white matter close to the NAc. Traditionally, LFP signals from white matter have been considered to reflect volume conduction from nearby gray matter regions and be otherwise absent. Recently, it was shown that using bipolar referencing minimizes the volume conductance effects and there remains some signal in the white matter, which could be from white matter tracts [45]. This signal is of lower amplitude and spectral power than in gray matter, which could explain why obsessions are better distinguished in gray matter. White matter tracts have been increasingly used as targets for DBS in OCD where an increase in response has been associated with targeting the frontopontine tract [46][47][48]. Without diffusion MRI data we could not investigate how our results overlap with these tracts.
The model performance map also suggested that the peaks for obsessions and compulsions were located in the left hemisphere, suggesting that the signals could be lateralized. A resting-state functional connectivity study in healthy individuals suggests that ventral striatum connectivity is lateralized. Connectivity is stronger within the same hemisphere and left ventral striatum connectivity is stronger to the dorsomedial prefrontal cortex and posterior cingulate cortex. The authors, therefore, suggested that the left ventral striatum could have a more important role in linking saliency response to self-control and other internally directed processes [49]. Although our results do not provide evidence for laterality, the internal nature of obsessions and compulsions do corroborate with a greater involvement of the left ventral striatum.
Using the symptom provocation task and creating a model performance map to aid in optimizing might seem tempting. However, the current best practices for targeting the vALIC for treatment of refractory OCD are using individual data to guide the targeting based on pre-operative data [50]. The model performance map on the other hand is an aggregated normalized map using data that at the earliest can be gathered intraoperatively, after which a model would need to be trained-a step that takes about ∼30 min currently. Each individual only contributes three points per hemisphere to the map. This is further complicated by the fact that the model performance varied much between individuals. While the map is informative to guide future research we do not think such normative data is suitable to guide DBS targeting [51].
Our findings suggest that building patientspecific models works better than population models. There is variability between patients with regard to the target location of the DBS, clinical presentation of the symptoms and medicine use. It should not be surprising that with only 11 patients it is hard to build one model which can predict symptoms for a new patient. The prediction of patient identity did not work well. This is probably because of the variability between patients and may also be partly explained by not having enough data for the model to learn this variability. However, these results also indicate that learning patient specific models may more likely result in clinical tools than population models. In practice once a new patient has been implanted the model first needs to be trained.
The current strategy of always stimulation does guarantee to always stimulate when obsessions are present but at the cost of a high number of false positives (stimulating when there are no obsessions). As can be seen from figure 7 the deep learning models could improve the clinical benefit depending on the threshold chosen. The threshold chosen implies a trade-off between true positives and false positives. For example, accepting that you need to stimulate five times to capture one true positive would indicate a threshold of 20%. For this threshold the net benefit of treating all as positive is 9.85% while using Incep-tionTime it would be 13.3%. The difference is 3.5% indicating that a reduction in stimulation of 14% is possible [38] thus possibly saving the battery power of the device. By lowering the threshold there would be more false positives but less chance of missing true positives. By increasing the threshold there is the possibility of more battery savings at the cost of missing some true positives.
Our experiment to-date contains the largest sample size of participants undergoing symptom provocation (for OCD) while LFPs were recorded. The symptom provocation tasks were selected before the experiment in consultation with the patients and were highly successful for anxiety, agitation, obsessions and compulsions (figure 1). In contrast Miller et al [52] did perioperative symptom provocation in one patient while Rappel et al [15] collected both resting state and symptom provocation LFP data from two OCD patients and compared to four Parkinson's disease. However, one of the OCD patients dropped out early so they only presented data from the symptom provocation from one patient. Miller et al did find gamma oscillations in the NAc modulated by symptom provocation. However they found opposite effects in the two locations measured (2 mm apart) suggesting these modulations are variable depending on location. Rappel et al measured LFP in the STN so it is hard to compare that to our data. Provenza et al [53] found a significant negative correlation between delta band power and self-rated OCD symptoms during natural exposures at home in one patient. This relationship was preserved during planned exposures. They presented data from two other patients during planned exposures but none of the correlations were significant. We are to our knowledge the only group so far presenting results in such a large sample of patient data with symptom provocation.
Several studies have recently aimed to identify neural signatures that could enable closed-loop DBS for major depressive disorder (MDD). Scangos et al published a series of papers in 2021 involving a single patient with treatment resistant MDD [54,55]. In the first paper personalized symptom-specific biomarkers are identified along with a treatment location where stimulation improves symptoms. Then in the second paper they apply this and show an improvement in symptoms using ventral capsule/ventral striatum stimulation and bilateral amygdala gamma power as the biomarker. In another study Veerakumar et al used long term LFP recordings during subcallosal cingulate cortex stimulation in four patients and identified increase in 1/f slope as a candidate biomarker for predicting treatment response [56]. There is however no follow-up study exploring 1/f as a biomarker. The personalized approach by Scangos et al in MDD seems to support our results that there is great individual variability and that a tailored approach for each patient might be required at every stage (targeting, biomarker discovery and modeling) for optimal results.
There are some limitations to this study. Although this is the largest OCD study with sensing LFP data so far, the sample size is small, due to the limited number of available devices. Additionally, the recording time was limited, in part due to limitations on how much data could be recorded before storing it on a computer. This will improve in newer generations of these devices. In the patient specific models, we are predicting symptoms in the last round using the other rounds as training data. This means we are predicting symptoms occurring a few minutes later. A more realistic task would be predicting symptoms in a future session, separated by weeks or even months. As Provenza et al [53] show in one patient there can be significant variability in correlation between OCD symptoms and spectral power between sessions. We used a black box model, and it is hard to figure out what the model is using to predict the symptom state. Further these are chronically ill patients and as such were on different medications. Due to low sample size we could not control for medication use nor OCD subtypes. Most of the patients were female and eventual non-responders which might have affected the results, though this may be a more important consideration for the group models than the patient-specific models. Due to the nature of the task, most patients were moving during the compulsion state (e.g. washing hands), it cannot be ruled out that motion-related changes in neural activity influenced those results. Further experiments will need to take into account and control this confound for the prediction of compulsions. To analyze model performance by location warping into MNI space was used to localize the electrodes. There is some uncertainty in this, however the algorithm used is optimized for maximal precision of subcortical deformations [57].
Future work should focus on gathering more data per patient. The individually tailored symptom provocation used here works well to induce symptoms. Newer devices will be capable of storing more data on the device. With more data better models could be built and more research done on under what conditions the models can detect symptoms. Future research should also work on using explainability to visualize and interpret model decisions. That would give insight into how the model identifies symptoms as well as into the disease itself.

Data availability statement
The data cannot be made publicly available upon publication because they contain sensitive personal information. The data that support the findings of this study are available upon reasonable request from the authors.