Brought to you by:
Paper

Late rectal bleeding after 3D-CRT for prostate cancer: development of a neural-network-based predictive model

, , , , , , , , , , , , , , and

Published 21 February 2012 © 2012 Institute of Physics and Engineering in Medicine
, , Citation S Tomatis et al 2012 Phys. Med. Biol. 57 1399 DOI 10.1088/0031-9155/57/5/1399

0031-9155/57/5/1399

Abstract

The aim of this study was to develop a model exploiting artificial neural networks (ANNs) to correlate dosimetric and clinical variables with late rectal bleeding in prostate cancer patients undergoing radical radiotherapy and to compare the ANN results with those of a standard logistic regression (LR) analysis. 718 men included in the AIROPROS 0102 trial were analyzed. This multicenter protocol was characterized by the prospective evaluation of rectal toxicity, with a minimum follow-up of 36 months. Radiotherapy doses were between 70 and 80 Gy. Information was recorded for comorbidity, previous abdominal surgery, use of drugs and hormonal therapy. For each patient, a rectal dose–volume histogram (DVH) of the whole treatment was recorded and the equivalent uniform dose (EUD) evaluated as an effective descriptor of the whole DVH. Late rectal bleeding of grade ≥ 2 was considered to define positive events in this study (52 of 718 patients). The overall population was split into training and verification sets, both of which were involved in model instruction, and a test set, used to evaluate the predictive power of the model with independent data. Fourfold cross-validation was also used to provide realistic results for the full dataset. The LR was performed on the same data. Five variables were selected to predict late rectal bleeding: EUD, abdominal surgery, presence of hemorrhoids, use of anticoagulants and androgen deprivation. Following a receiver operating characteristic analysis of the independent test set, the areas under the curves (AUCs) were 0.704 and 0.655 for ANN and LR, respectively. When evaluated with cross-validation, the AUC was 0.714 for ANN and 0.636 for LR, which differed at a significance level of p = 0.03. When a practical discrimination threshold was selected, ANN could classify data with sensitivity and specificity both equal to 68.0%, whereas these values were 61.5% for LR. These data provide reasonable evidence that results obtained with ANNs are superior to those achieved with LR when predicting late radiotherapy-related rectal bleeding. The future introduction of patient-related personal characteristics, such as gene expression profiles, might improve the predictive power of statistical classifiers. More refined morphological aspects of the dose distribution, such as dose surface mapping, might also enhance the overall performance of ANN-based predictive models.

Export citation and abstract BibTeX RIS

1. Introduction

The prediction of drug toxicity in prostate cancer patients is a very important issue and has attracted growing interest in the scientific literature in recent years.

The pretreatment estimation of the toxicity risk associated with radiation therapy has essentially been based only on dose–volume effects (Fiorino et al 2009a, 2009b). Only in recent years a few studies have included clinical risk factors in these models (Peeters et al 2006, Valdagni et al 2011, Defraene et al 2011). An integrated approach, including clinical and possible genetic risk factors (Valdagni et al 2009, Peters et al 2008), is becoming increasingly more important in the intensity-modulated/image-guided radiation therapy era, in which a dose is increasingly shaped around the target tissue, to minimize the dose received by organs at risk.

The general issue of predictive modeling in prostate cancer for the detection of both the disease and the adverse effects related to radiotherapy involves the use of different tools based on either traditional methods or, more recently, nonlinear approaches, such as artificial neural networks (ANNs) or support vector machines (Valdagni et al 2008, Gulliford et al 2004, Buettner et al 2009, Pella et al 2011, El Naqa et al 2006, Su et al 2005).

The National Working Group on prostate radiotherapy within the Italian Association of Radiation Oncology (AIRO) began a prospective multicenter trial in 2002 (AIROPROS 0102), focusing on late rectal bleeding. Twenty-two centers enrolled 1132 patients, 718 of whom were available for the evaluation of late toxicity after a minimum follow-up of three years.

Our goal was to use to this database as an ANN model to predict late rectal toxicity and to compare the results with those of a standard logistic regression (LR) model. This should provide clinicians with a quantitative tool to better tailor radiation treatments to individual patients and to assist patients in the decision-making process.

2. Materials and methods

2.1. The AIROPROS 0102 trial

From July 2002 to March 2004, 1132 patients were enrolled in the AIROPROS 0102 prospective nonrandomized multicenter trial. A detailed description of the protocol, aims, selection criteria, radiotherapy techniques and technical/dosimetric issues has been reported previously (Vavassori et al 2007, Fiorino et al 2008, Fellin et al 2009). Patients with histologically confirmed prostate adenocarcinoma were selected. All patients were treated with three-dimensional conformal radiation therapy (3D-CRT) with a total International Commission on Radiation Units and Measurements (ICRU) dose ≥70 Gy, delivered with conventional fractionation at 1.8–2 Gy/fraction.

Information was recorded on comorbidity (with particular attention to hypertension, cardiovascular history, diabetes mellitus and autoimmune diseases), previous abdominal surgery (rectum–sigma resection, kidney resection, cholecystectomy and appendectomy), use of drugs (in particular, anticoagulants or antiaggregants, antihypertensives, hypoglycemics or insulin), previous or concomitant locoregional diseases related to the bladder, ileum, colon, rectum, anus or prostate and associated pharmacological treatments. The physicians were asked to collect information on the type and duration of hormonal therapy, if prescribed.

Potential bias in contouring, rectal length definition and dosimetric consistency were controlled by implementing rigid pretreatment recommendations, previously agreed upon by the participating institutions (Foppiano et al 2003).

The following technical and dosimetric data were considered for each patient: prescribed dose for each irradiated volume (pelvis, seminal vesicles, prostate) and maximum and mean rectal doses.

Rectal dose–volume histograms (DVHs) of the whole treatment were recorded for all patients and the percentage fractions of the rectum receiving more than 20, 30, 40, 50, 60, 70 and 75 Gy (designated V20Gy → V75Gy) were considered. The equivalent uniform dose (EUD; Niemierko 1999) was also calculated from the DVH, as a possible useful dosimetric predictor of late rectal toxicity through the power-law relationship

Equation (1)

where Di and υi correspond to the values for the ith point on the differential DVH and n is a nonnegative parameter. For small n (n → 0; serial organs), the EUD tends to the maximum dose, whereas for n = 1 (parallel organs), EUD is the mean dose. In this study, n = 0.03 was chosen as the G2/G3 late rectal bleeding endpoint. This value derives from a maximum-likelihood fit that was calculated for this population (Rancati et al 2011). Tables 1 and 2 show the distributions of the clinical and dosimetric/technical parameters, respectively, in this subset of patients.

Table 1. Patient characteristics according to clinical features.

Diabetes n (%)  45 (6.3%)
Hormonal therapy 558 (77.9%)
Hemorrhoids 151 (21.2%)
Use of anticoagulants/antiaggregants 150 (21.0%)
Use of antihypertensives 334 (46.8%)
Previous abdominal surgery  69 (9.7%)
Pelvic-node irradiation  39 (5.4%)
Seminal-vesicle irradiation 547 (76.5%)

Table 2. Distributions of dosimetric parameters.

  Mina Maxa First quartile Median Third quartile
Dose to pelvic nodes (Gy) 39.6  52.0 45.0 45.0 50.0
Dose to seminal vesicles (Gy) 19.8  81.0 60.0 66.6 74.0
ICRU dose (Gy) 70.0  81.6 72.0 74.0 76.0
Mean rectal dose (Gy) 11.6  69.1 45.2 51.5 56.2
Maximum rectal dose (Gy) 63.6  85.2 72.7 74.8 77.0
V50Gy (%)  7.1 100.0 37.5 49.3 63.0
V60Gy (%)  3.7  89.0 26.8 36.0 46.9
V70Gy (%)  0.0  68.5 11.3 18.3 25.0
V75Gy (%)  0.0  40.3  0.0  0.6  7.0
EUD (n = 0.03) (Gy) 59.9  75.8 67.7 69.5 72.0

aMin (Max): minimum (maximum) value in the population under study.

2.2. Definition of late rectal toxicity scores and endpoints

The patients were examined before they commenced treatment, once a week during radiotherapy, at the end of radiotherapy, one month after the completion of treatment and every six months thereafter. A self-administered questionnaire was used to score rectal and intestinal toxicities. This questionnaire was completed by the patients before the treatment (basal), within one month after the end of therapy and then every six months up to three years after the completion of radiotherapy. The responses to the questionnaire were used to classify the gastrointestinal symptoms according to the SOMA/LENT (subjective, objective, management and analytic/late effects of normal tissue) scoring systems for late radiation morbidity.

To develop both the ANN and logistic-based predictive models, we focused on late rectal bleeding, which was defined as follows. (a) G2: late rectal bleeding that occurred > 2 times per week; (b) G3: late rectal bleeding when daily bleeding was registered or if a blood transfusion and/or laser coagulation and/or a surgical procedure was necessary. We defined 'bleeders' as those patients showing a G2/G3 bleeding event at any time from five months after the completion of 3D-CRT, even in cases of full recovery.

2.3. Statistical methods

According to a review of the medical literature, all relevant clinical features, comorbidity, use of drugs and dosimetric parameters can be included in predictive models to assess the risk of toxicity after radiation therapy. Toxicity was coded as a binary (yes/no) output variable according to the positive or negative occurrence, respectively, of the endpoint considered. According to this classification, 52 bleeding events were observed in the 718 patients of the total population.

The performance of the systems in predictively distinguishing positive from negative cases was described with a receiver operating characteristic (ROC) analysis. The ROC curve shows all possible pairs of sensitivity and specificity values as the decision criterion, i.e. the cut-off in the complication probability or classifier output is systematically varied. The area under the ROC curve (AUC) was used to quantify the classification capacity of the system. The ROC analysis, including testing the differences between the curves, was performed with MedCalc® (MedCalc software, Mariakerke, Belgium).

2.3.1. Feature selection

The genetic algorithm (GA; Goldberg 1989) was initially applied as a data reduction technique to assist in the selection of a subset of significant predictors. GA is an optimization procedure that can search efficiently for binary strings. In our case, the bits in the strings indicate whether to accept (1) or reject (0) each possible input variable. GA then evaluates their fitness (i.e. how good a solution they represent) and eliminates poor ones, breeding replacements using the artificial genetic operators, mutation and crossover. For each string, a specific (probabilistic) neural network is created to allow performance (fitness) to be evaluated in terms of the difference (error) between the observed and estimated probabilities. An elevated network error is an indicator of irrelevant input variables.

The GA was implemented with the standard Holland method with elitism (unaltered retention of the best string from each generation) and roulette selection, biased according to the relative fitness of the strings within generations (Goldberg 1989). Fitness is normalized linearly before selection so that the ratio of best:worst fitness is 2:1. This ensures that a constant selective pressure is maintained throughout the duration of the algorithm. During crossover, a random point on the string is selected, and the two strings 'swap ends' to create two children. Crossover allows two strings with different good features to combine them together into a single, more powerful string. During mutation, some of the bits are randomly flipped; the corresponding mutation rate was set to 1 per thousand bits (0.001), as suggested by Goldberg (1989). The population of strings at each generation was set to 100, with a limit of 100 generations.

Figure 1 shows the typical behavior of the fitness value for the 'elite' string at different generations of the GA over the selected range.

Figure 1.

Figure 1. An example of the normalized values for the fitness best string with respect to the generation number.

Standard image

The genetic algorithms were supported by the Statistica neural network data reduction module (StatSoft, Statistica Neural Network reference manual18).

2.3.2. Network setup and validation

Extended documentation on the use of ANN and the instruction and validation techniques involved can be found in Bishop (1995).

A multilayer perceptron model with one hidden layer and one output neuron was chosen as the basis for all the different network configurations examined. Basically, this ANN was composed of three layers (i.e. input, hidden and output layers), with one or more nodes, called 'neurons'. Each neuron is connected to all the neurons of the adjacent layer. The connections and neurons are characterized by weights and bias values, respectively, and define the actions of the ANN. A dichotomous response is obtained by choosing one single output neuron and comparing its output value with a decisional threshold.

The variables selected after the data reduction phase were inserted as input to set the ANN classifiers.

For ANN training, the 'hold out' method (Bishop 1995) was first applied as follows. The overall population was divided into three sets: training (359 cases, including 26 bleeders), verifying (179 cases, including 13 bleeders) and test (180 cases, including 13 bleeders). To lower the perturbation of the classifier outputs caused by data partitioning, specific attention was paid to maintaining the original (overall) condition of 7.2% positive cases in each selected group, according to a practice known as 'stratification' (Koavi 1995, Forman and Sholz 2010). The training and verifying sets were both involved in network instruction: the training set was used to optimize the inner fitting parameters of the ANN by means of an iterative algorithm (back propagation), whereas the verifying set was used, according to the 'early stopping' method, to monitor the progress of the iterations and to stop the process at the minimum value of the verifying error. The test set was omitted from the instruction process and used only to evaluate the predictive power of the system with independent data. The same instruction and test sets were used to set up the logistic model.

During training, the difference (error) between the network prediction and the target output was evaluated up to a final value as an indication of the fitting performance. Many different network configurations were considered to couple the selected set of input variables to a proper structure of the hidden layer. To that end, a sequence of different models with increasing numbers of neurons from 1 to 10 were instructed, repeating the instruction 10 times for each model selection. The final error of the verifying set was recorded as the average within each group with the same number of neurons to help in the assessment of an optimal choice of network complexity. The cross-entropy function, which involves the products of the target value and the logarithm of the output–target differences, was selected as suitable for classification purposes. For more details, see Bishop (1995).

The relative importance of an input variable in the selected ANN was determined with a procedure called 'sensitivity analysis' (Hunter et al 2000, Zurada et al 1994). This method is based on the evaluation of the variation in the network error when the variable is treated as missing in the model (i.e. replaced by a constant value). The higher this difference, the greater the importance of the variable.

In addition to the 'hold out' technique explained above, a fourfold cross-validation procedure was applied to obtain classification results for a full set of independent data. All data were uniformly split into four groups (folds). For each fold, used as the test set, a model was fitted to the other three (two used for training and one for verifying) devoted to network instruction. The role assigned to each group was rotated for different folds to allow the collection of test results for all data. In this case, stratification was also applied, i.e. the number of bleeders in each fold was chosen to reflect the 7.2% ratio in the full population.

Statistical analyses, including the ANN setup and sensitivity analysis, were performed with commercial software (Statistica, StatSoft, Tulsa, OK, USA).

3. Results

At the end of the GA-based instruction process, five variables were selected as predictive of the diagnostic output. The variables finally selected were: EUD, abdominal surgery, presence of hemorrhoids (hemor), use of anticoagulants (antico) and androgen deprivation (AD). The final value for normalized fitness after 100 generations (see figure 1) was 0.849. The EUD was calculated for each patient from the corresponding DVH (equation (1) with n = 0.03).

Figure 2 shows the variation of the relative error of the ANN model depending on the number of hidden network nodes. The figure shows how an increase in the number of hidden units does not cause a marked reduction in the corresponding error after—three to four neurons. Thus, a network architecture with three neurons in the hidden layer was finally selected as the one best able to balance the stability of the system with optimal classification performance.

Figure 2.

Figure 2. Plot of the normalized network error for different neurons in the hidden layer. Dots are the mean values for a given model; bars are the standard deviations (SD) for repeated runs.

Standard image

The network architecture is shown in figure 3. The relative importance of each input variable for the classification, evaluated with a sensitivity analysis, is also shown in the figure. Training was performed with the backpropagation algorithm with 48 epochs. The final training and verification errors were 0.569 and 0.598, respectively. The corresponding test error was 0.616.

Figure 3.

Figure 3. ANN configuration for late rectal bleeding. The relative importance of the input variables (error ratio) is represented by the bars on the left and was derived with a sensitivity analysis.

Standard image

Figure 4 shows the classification results for late rectal bleeding with respect to the ANN and the logistic regression classifiers with the 'hold out' technique, according to the ANN model shown in figure 3. The curves are shown separately for all the data involved in the instruction phase (joined training and verification sets) and for the independent test data.

Figure 4.

Figure 4. ROC plots for late rectal bleeding based on the 'hold out' method according to ANN and LR: thick line—training/verifying sets; thin line—independent test set.

Standard image

Figure 5 shows the ROC plots for cross-validation. All scores evaluated from the test set of each fold were used to compute the ROC curve for the full dataset. A comparison between ANN and the reference LR is also shown in the figure.

Figure 5.

Figure 5. Fourfold cross-validation ROC analysis for late rectal bleeding with different classifier types. All plots were obtained after training each fold and merging the late complication probabilities of all independent test sets. Thick line: ANN; thin line: LR; circles: points with equal sensitivity and specificity; closed symbol: ANN; open symbol: LR.

Standard image

The plots in figure 5 can be interpreted as a realistic estimate of the behavior of an 'average' classifier. Based on this model, optimal classification could be achieved, in practice, by setting the probability output cut-off to produce equal sensitivity and specificity values. This choice leads to a sensitivity (and specificity) of 68.0% and 61.5% for ANN and LR, respectively (circles in figure 5).

Table 3 reports the numerical evaluations of the ROC analysis. The significance of the differences between ANN and LR is reported for both the 'hold out' and cross-validation data according to the method devised by DeLong et al (1988), a nonparametric approach to the analysis of areas under ROC curves, by using the theory of generalized U-statistics to generate an estimated covariance matrix.

Table 3. Evaluation of the performances of the predictive models.

    AUC  
Experiment Set ANN LR pa
Hold out Training/verifying 0.730 0.658 0.03
  Test 0.704 0.655 0.43
Cross-validation Full 0.714 0.636 0.03

aDifference between ANN and LR.

No significant differences were observed in the classification between phases and within the ANN and LR models between the training/verifying set and the test set in all the corresponding AUCs in table 3.

Table 4 illustrates the results of the analysis for logistic regression with odd ratios and the corresponding statistical evaluations for this analysis.

Table 4. Results of the logistic regression analysisa.

Variable OR SE p ±95% CI
EUD 1.16 0.08 0.03 1.014 1.316
Surgery 2.58 1.18 0.04 1.050 6.341
Hemor 1.69 0.61 0.15 0.832 3.446
Antico 0.61 0.28 0.29 0.247 1.513
AD 0.55 0.21 0.11 0.266 1.143

aOR: odd ratio; SE: standard error; CI: confidence interval.

4. Discussion

This study shows the results of a first attempt to apply ANN to the prediction of late rectal complications in patients treated in Italy for prostate cancer with high-dose conformal radiation therapy. Because of its well-known characteristic of nonlinearity, ANN is generally considered a very flexible tool for data fitting, especially in complex situations, and regardless of the distribution of the input parameters. In contrast, ANN suffers a lack of 'transparency' in data interpretation compared with other traditional statistical methods. To limit this problem, we tried to assign roles to the different input variables and ranked their importance in the model according to a sensitivity analysis (figure 3).

Based on this evaluation, dose appears to be the factor most responsible for late rectal bleeding. Here, dose was represented by the variable EUD calculated with a volume parameter, n = 0.03. This was previously found to be the best fit for this population (Rancati et al 2011). By definition (equation (1)), the EUD is associated with the DVH components at different doses, with increasing correlations at higher doses with decreasing n. The choice of n = 0.03 should be interpreted as a strong bias toward the highest doses of 70–75 Gy.

Rectal dose is a well-known and outstanding risk factor for rectal bleeding, as has been extensively demonstrated by many authors in recent years (Peeters et al 2006, Söhn et al 2007, Valdagni et al 2011). As a consequence, strict DVH constraints on treatment planning have been proposed and were applied in the AIROPROS 0102 population within the 40–70 Gy range. This practice of constraining the dose up to 70 Gy may (at least partly) explain the minor role played by the DVH components related to doses lower than 70 Gy on the ANN classification.

Besides EUD, our data suggest that the four additional clinical features are responsible for rectal bleeding: abdominal surgery, presence of hemorrhoids, use of anticoagulants and hormonal therapy, with abdominal/pelvic surgery playing a major role. As with dose, the importance of surgery has been emphasized in other reports (Peeters et al 2006, Fiorino et al 2008, Fellin et al 2009). According to these studies, the higher sensitivity acquired by patients after experiencing previous surgery is associated with the observed increased risk of bleeding.

A recent report based on the same AIROPROS 0102 dataset (Valdagni et al 2011) indicated that high rectal dose (V75Gy) and surgery are the main contributors to a nomogram developed for the prediction of late rectal bleeding. The variables hemor, antico and AD were also included in this nomogram as a third variable, called 'nomacu', which was shown to represent acute Radiation Therapy Oncology Group (RTOG)-defined toxicity (Valdagni et al 2008). Therefore, both the ANN- and nomogram-based classifiers share similar input variables of high rectal dose, surgery and acute toxicity.

From table 4, the major roles played by EUD and surgery also appear to be confirmed for LR. Although a borderline (for hemor and AD) or weak (for antico) association with late rectal bleeding is observed for the remaining clinical components, this figure does not seem to lead to biased conclusions. In fact, no better classifications were obtained by either removing the three variables from the analysis, leading to a reduction of 0.017 in the AUC, or by substituting them with nomacu, to more closely approach the set of input variables reported by Valdagni et al (2011). In the latter case, a small reduction of 0.006 in the AUC was observed.

To the best of our knowledge, a few studies have reported the use of ANN models to predict late rectal bleeding after radiotherapy (Gulliford et al 2004, Buettner et al 2009), but this is the first report in which clinical factors have been added to the usual volume–dosimetric information to enhance the predictive power of an ANN classifier.

An early paper (Gulliford et al 2004) reported a population of 119 patients with an incidence of 18 (15%) G2/G3 RTOG graded events for bleeding, recorded in a retrospective study. The data were tested with a fourfold cross-validation technique, with a resulting sensitivity of 66.1% and specificity of 56.4%. These results are slightly worse (especially for specificity) than our 68% sensitivity/specificity classification obtained with cross-validation.

Another more recent report by the same research group (Buettner et al 2009) dealt with a sophisticated application of dose surface maps (DSMs) to classify rectal bleeding with locally connected ANNs and ensemble learning in a population of 329 patients with 53 (16%) observed occurrences of late rectal bleeding. The paper focused on the significant increase in the predictive power of the classifier (AUC = 0.64) compared with the traditional dose surface histogram approach (AUC = 0.59) after morphological information was included in the form of DSMs. This improvement indicates that additional predictive power is inherent in the shape of the dose distribution.

Given the different aims and approaches of these studies and ours (reporting an AUC = 0.71; table 3, cross-validation), any comparisons should be made with caution. Instead, the conclusions of these studies suggest that better results could be achieved with ANNs using both the morphological aspects of the dose distribution, such as dose surface mapping, and relevant clinical factors, i.e. surgery and pretreatment indicators of acute toxicity.

The significant reduction of the initial pool of candidate predictors to only 5, performed with the GA-based feature selection procedure, may seem surprising at first sight. However, this result agrees well with the choice made by Valdagni et al (2011), as discussed above, indicating that it is reasonable to use a relatively small fraction of the starting variables as the input set. The observed similar structure of the input variables is probably ascribable to the population shared by the two studies rather than to a loss of some important input feature missed by the GA. This agreement in feature selection between the two studies also supports the ground of our comparison results between ANN and LR, because both models were based on a meaningful choice of the input variables. It should also be noted that the use of the EUD in place of each of the different DVH components is itself a way of reducing the dimensionality of our dataset in radiobiological terms (Rancati et al 2011). A set of a few (2–6) prognostic variables is usually used in the elaboration of nomograms and models that integrate clinical and dosimetric information, developed to predict different gastrointestinal toxicities (Valdagni et al 2009).

Because no rules exist that precisely define the number of inputs and hidden neurons used, the final ANN configuration was chosen to produce a sufficiently simple architecture and to satisfy some general rules of thumb, such as the Berry and Linoff (1977) thumb rule (number of hidden neurons <2 × number of inputs) or Uncle Bernie's rule (the total amount of weights and biases should be less than one-tenth of the number of training cases) (Widrow 1987, Baum and Haussler 1989, Carrara et al 2007).

Another point currently of great interest is the development of increasingly more complex and complete normal tissue complication probability (NTCP)-based models to assess the risk rectal toxicity. To improve the capacity of these models, efforts in this research field have recently been directed toward the inclusion of predisposing clinical factors in traditional dose–effect analyses. In a recent study, Defraene et al (2011) reported a maximum AUC value of 0.77 for rectal bleeding when surgery and cardiac history were included in three different NTCP-based approaches (the Lyman–Kutcher–Burman model, a relative seriality model and a logistic model). The authors concluded that comparable results could be achieved with all the different models, and that these were improved by the inclusion of clinical factors. However, the absolute AUC values should be confirmed by analysis with an independent test sample.

4.1. 'Hold out' method

Substantial agreement, with no significant differences in AUC, was observed between the two instruction populations and the test population in the 'hold out' experiment, for both ANN and LR. This indicates that the models can be generalized to accommodate new data. However, the use of training data, even for model comparisons (reporting p = 0.03 between models, table 3), can only be considered a preliminary result, because the results should really be validated with an independent test set.

The analysis by 'hold out' involves the training of a single model for use in the prediction of new cases. As discussed, this technique allows us to assign roles to the network input variables and to inspect the characteristics of the different training/validation phases. However, for validation, this method makes use of only a fraction of the total available population, with a consequent risk of producing partial results with little statistical power. In particular, the observed difference between ANN and LR was not significant for the test set.

4.2. Cross-validation

Cross-validation was used to extend our classification to the whole dataset, increasing both its statistical power and the representativeness of the data. Focusing on these results, it is interesting to note that the presence of a significant p value of 0.03, corresponding to an AUC difference close to 0.08, is of clinical interest when comparing ANN with LR (table 3). The results for the cross-validation experiment reasonably support the conclusion that ANNs have better predictive power than LR.

With stratification, we ensured the same proportion of bleeders in each fold as in the general population (7.2%). It has been shown (Koavi 1995) that stratification increases the model stability during cross-validation and that if the model (inducer) is stable (i.e. the induced classifiers make the same predictions), the validation results are unbiased, regardless of the number of folds involved. For the classification in our folds, we calculated the average AUC ± SD to be 0.71 ± 0.01, indicating a rather stable behavior, with a reasonably low risk of bias. Although the use of more than fourfolds (5 or 10) may be considered a more standard practice, the choice of fourfold cross-validation has been reported by other authors using ANNs to predict rectal bleeding (Gulliford et al 2004).

It might be asked here whether further improvement could be achieved, regardless of the model under study, with a selection of new variables carrying some more, independent information. This has been suggested in the literature by some authors who have analyzed the gene profiles of particular patients to explain why, despite reasonable satisfaction of the DVH dose constraints, they experienced rectal bleeding, whereas other patients who did not satisfy the DVH dose constraints did not (Valdagni et al 2009). In other words, a particular genetic susceptibility, if taken into account and used as input in an automated classifier, could significantly increase the predictive power of the model. Unfortunately, no gene profiling was available for our dataset, precluding any direct verification of this hypothesis.

5. Conclusions

In conclusion, our data reasonably support the observation that the ANN approach performs better than LR. With cross-validation, a clinically relevant difference of about 0.08 in the AUC was identified between models, at a significance level of p = 0.03. A specific choice of the ANN output score cut-off led to equal sensitivity and specificity values of 0.68.

The main predictors of late rectal bleeding were EUD (n = 0.03) and surgery. Other input variables were found to be clinical predictors of RTOG acute toxicity.

Data from this study and from the literature suggest that the results could be enhanced by taking into account the morphological aspects of the dose distribution, such as dose surface mapping, and certain specific characteristics of the patient, such as the patient's genetic profile.

Acknowledgments

We thank Dr Mauro Carrara and Professor Giacomo Aletti for their helpful discussions on the technical aspects of neural networks.

Footnotes

Please wait… references are loading.
10.1088/0031-9155/57/5/1399