Feasibility of machine learning algorithms for classifying damaged offshore jacket structures using SCADA data

The best practise for structural damage detection currently relies on the installation of structural health monitoring systems for the collection of dedicated high frequency measurements. Switching to the employment of the wind turbine’s SCADA (Supervisory Control and Data Acquisition) signals and their commonly recorded low frequency statistics can lead to a reduction in the number of ad-hoc monitoring sensors and quantity of data required. In this paper, aero-hydro-servo-elastic simulations for a model of a turbine are used to assess its loads and any changes in the dynamics under healthy state and a damaged configuration case study. To prove the feasibility of the damage detection through low-resolution data, the statistics of the typically recorded signals from the SCADA and the structural monitoring systems are fed into a database for training and testing of classification algorithms. The ability of the machine learning models to generalise the classification for both stochasticity and uncertainties in the environmental conditions are tested. Decision tree-based classifiers showed the capability to capture the damage for the majority of the operating conditions considered. Though the setup of the traditional SCADA sensors had to be supplemented with an additional structural health monitoring sensor, the detection of the damage has been shown feasible by referring to low-frequency statistics only.


Introduction
Several strategies have been investigated and exploited towards reducing the levelized cost of energy (LCOE) of offshore wind farms. So far, the focus was on a 'race to the bottom' for development costs, pushing for the optimization of design parameters and installations strategies. During the last decade, the LCOE reduction has been targeted through reduction of the operation and maintenance costs [1], which are expected to reach 30% of the asset's life time cost for the next generation of offshore farms [2]. Although structural damage is not very likely, its late detection can lead to critical consequences which will result to high cost of mitigation actions [3]. On the other side, technical assessments and knowledge of the status of the turbine's support structure are necessary to prove that operating assets can maintain the required safety levels during lifetime extensions [4].
Current practice for structural damage detection strongly relies on inspections, i.e. practical assessments on-site, which are associated with significant costs and risks due to the offshore environment, especially for structural failures below sea-water level [5]. With respect to data-driven techniques, research and applications of vibration-based structural damage detection can be found in literature [6], [7]. The methods proposed mostly identify damage using either natural frequencies or mode shapes and their derivatives such as the displacement modal curvature [8]. Natural frequencies  Figure 1 shows the flowchart of the approach. The methodology developed in this paper is based on the following three main steps:

Methodology
• The dynamics of the structure are analysed in response to environmental and operational conditions (EOC) and as a consequence of the structural failure. Semi-coupled simulations of the finite element (FE) model of the support structure, in healthy and damaged condition, and the model of the flexible structure of the turbine tower are run for a representative set of EOC. The time histories of loading conditions and relative structural responses are then postprocessed into ten-minute statistics, similarly to SCADA data. • The data are collected into a database, which is sequentially accessed to construct the datasets for the training and testing of machine learning algorithms. Each dataset for varying environmental loadings of the structure has a 50:50 ratio of healthy and damaged conditions. The features, i.e. the set of independent variables used for the prediction, of each dataset are then standardised before using in the models. The dataset in analysis is then divided into subsets, named "Tr" for the tuning/training subsets and "Te" for testing subsets. • The tuning, training and testing methods are applied in the same manner for all datasets considered and all algorithms investigated. The detection capability of logistic regression, support vector machine, k-nearest neighbour, random forest, and Gaussian naïve Bayes based classifiers is examined. Each algorithm's best performance is obtained through an iterative process. If a classifier performs in an unsatisfactory way on the test set for stochastic variation of the loadings, feature and dimension reduction techniques are applied (see Figure 1, in green). The most promising algorithms are then tested on the subsets for variations of the environmental parameters (see Figure 1, with "dashed" lines in the block for "synthetic data generation" and "data processing"). If insufficient detection performance is exhibited at this stage, the subset for the training of the classifier is extended to include additional features and/or the data from the misclassified cases. Tuning, fitting and testing are then repeated (see Figure 1, in blue).
Eventually, a recommendation of the best training set and algorithm for the damage detection task is given. Figure 1. Workflow for the data generation, the datasets processing and the algorithms training and testing. In "dotted lines" are the iterative processes for the re-training of an algorithm.

Synthetic data generation
It is very unlikely to have access to real data of damaged states, as such event rarely happens for the subsystem under consideration, and such data would be unlikely to be shared for research purposes due to confidentiality reasons. For the creation and collection of the synthetic data, Ramboll's in-house software ROSAP (Ramboll Offshore Structural Analysis Programs) [21] and LACFlex aero-servo-  [22] are used in a semi-coupled approach for a model of a turbine installed on jacket foundations (cp. Figure 1). The ROSAP packages are used for the definition of the as-designed jacket geometries, material properties, and environmental loads of the jacket structure. The full model of the substructure is reduced into Craig-Bempton superelements [23], approximating its dynamics by including only a limited number of deformation modes. The LACFlex aero-servo-elastic software allows an accurate modelling of the tower and the rotor-nacelle assembly, together with the control strategies in response to the stochastic turbulent wind loads acting on the turbine. The superelement files derived in ROSAP are then coupled with the wind turbine model in LACFlex for each combination of wind and wave loads, and integrated simulations are carried out.
Damage implementation As case study, the full loss of a cross-member of the jacket structure is simulated. The stiffness of a brace close to the seabed which connects diagonally two of the legs of the jacket is reduced to a value close to zero. This failure location is selected because it is associated to a high deviation of the global natural frequencies. It is worth mentioning that this approach, which is based on simulated data, carries some uncertainties. On one side, the coupling of the foundation dynamics via the superelement innerly hold some small discrepancies with the substructure dynamics (until 10 Hz), with respect to the full model [23]. On the other side, the global damping of the structure is assumed to be the same as the one defined at the design phase, neglecting thus the possible effect of the structural failure. These model uncertainties are here acknowledged but are not tackled in this paper. Nonetheless, the analysis is harmonized, by extending the first assumption to the healthy model as well. The second assumption is judged acceptable, since the aim of the investigation is to demonstrate only the detection feasibility.
Simulations setup Focusing on detection during normal operational conditions (power production) of the turbine, the fatigue limit state load case is used for the setup of the EOC, as specified in DLC 1.2 [24]. A set of representative load combinations is then investigated. Considering the geometry of the jacket and the implemented damage, only 4 wind directions and 12 wave directions are simulated, for 6 values of average wind speed at the hub height (3 below and 3 above rated conditions). Nine realizations of the wind and wave time histories are processed for each loading combination, to guarantee the capability of the detection algorithms to distinguish the response due to load stochasticity from one of the damaged status. Therefore, a total of 2,592 simulations per turbine status are performed.
To account for the uncertainty associated to the real operational conditions of the turbine, the healthy and damaged structural responses are derived for changes in the wind farm flow conditions as illustrated in Figure 2. The wind shear exponent (WS) -which is potentially correlated to multiple factors [25] is varied from the design base specification to a minimum of 0.08 (WSL) and a maximum of 0.3 (WSU), changing the distribution of the normal wind profile. The 90 th percentile of the effective turbulence intensity (TI) is used as reference, a value considered representative for the fatigue design calculations [26]. Based on the experience from a similar farm, an upper (TIU) bound curve is defined to represent the missing extreme cases. Similarly, a lower (TIL) bound curve is drawn corresponding to the 10 th percentile of the effective turbulence intensity. These are then implemented in a Mann turbulence model to represent the fluctuating wind field.

Pre-analysis and datasets processing
Effect of environmental parameter variations As outlined in [5], any structural health monitoring method employed in the detection task must be able to distinguish between signal variations related to EOC, as opposed to the ones corresponding to a structural anomaly. The damage implemented was observed to lead to significant changes of the second modes and their natural frequencies with respect to the healthy status. It is then reasonable to expect that this has an impact on the loads at the tower base, building the interface between the turbine and the jacket structure. However, variations of the environmental conditions could affect the global response in a similar manner.
Therefore, a pre-analysis for a reduced number of simulations is performed to investigate the influence of wind flow parameters (upper-and lower-bound values) on the structural response. The deviation of the loads at the tower interface, with respect to the design load case, is presented in Figure 3, for a below rated load combination. The box plots of the time histories of forces and moments in the fore-aft (y) and the side-side (x) direction are reported for the nine stochastic variations of wind and wave loadings. It can be noticed that the wind shear exponent does not seem to significantly affect any of the loads. On the contrary, a high level of turbulence intensity is associated to higher load ranges and standard deviations compared to the design base scenario. Opposite behavior is then observed for the low-turbulence level. Consequently, datasets for variations of the TI need to be fed to the machine learning models during its training, as opposed to variation of the wind shear that can be fully captured in the structural dynamics of the design load case. Detectability and signals deviation The 50 Hz time histories output from the aero-hydro-servo-elastic simulations are post processed into ten-minute minimum (min), maximum (max), average (mean), and standard deviation (std). At first, all the measurable signals are collected into the database of operational conditions, potentially being meaningful indicators of the structural failure.
The potential predictors of the damage are then investigated by quantifying the deviation of the statistics, given the scatter for the stochastic variation of the loading, for the healthy and damaged status data. A visual representation of this assessment is given in Figure 4, where trends in the mean of the tower top acceleration and the DEL of tower base bending moments are presented. With respect to the DEL of the bending moments, only the MyF0 related to the side-side motion ( Figure 4) deviates slightly from the healthy status for some wind-wave misalignment angles. The DEL of the MxF0 in the fore-aft motion of the tower base (Figure 4), as well as the DELs of the moments at the tower top, at the foundation and at the blade roots remained mainly unaffected. On the contrary, it was observed that the  Figure 4), the power output range (for the extreme turbulence case), the tower bottom acceleration and rotations and some of the forces and moments in the drivetrain (mainly at the main bearing) record discrepancies from the healthy operating conditions, throughout the load combinations considered (not all shown for brevity).

Data subsets for training and testing
A summary of the datasets and subsets considered is given in Table 2, whereas Table 3 details the considered sensors. A detection through standard SCADA signals (sensor setup S0) is preferred and attempted at first. Initially, the investigation of the feasibility of the status classification is performed on the dataset of the design load combinations (dataset D0). Then the data derived for the different TI levels are added to this base scenario, expecting significant variations in the loadings and response of the structure (datasets D1, D2 and D3). Each of the training subsets (Tr#) consists of 67% of the data of the current set, by randomly selecting six out of the nine realizations for each load case. The remaining set of three realizations per load case, consisting of the 33% of the data in the set, is collected in the subset Te33 and used for testing. Additionally, the mid -upper and -lower TI curves ( Figure 2) are derived, and the statistics associated to these loadings are collected into the Te3 and Te4 test sets, over that the one for upper and lower bound values (in the test sets Te1 and Te2). To achieve satisfactory classification results, also the sensor setup is redefined during the training/testing iterative steps (cp. Table 3). Additionally, an inclinometer and strain gauges are assumed to be installed at the tower base for the indirect and direct measurement of strain. The advantages from the selection of these sensors' signals and location is: 1) in avoiding installing monitoring devices below the water level and, 2) to maintain the analysis as independent as possible from measurement from the drivetrain, which are highly related to the specific control strategies and optimization.

Training and testing of the detection algorithms
Training and testing methods Not having a clear indication on which of the signals (in Table 3) could stand as the best indicator(s) for a damage, the different sensor sets are used for the supervised classification to identify the selection of the best predictors based on the classifiers' performance. The algorithms are implemented with the Python machine learning package (skikit) [27]. The parameter tuning approach and the training methods applied are briefly described in the following. Subsequently, the key performance indicators for the assessment of the detection quality are introduced.

Grid search cross-validation and training.
A grid search is applied for the identification of the best hyperparameters for each model. The combination giving the overall best performance on the folds of the training set is selected (cross-validation). These folds, i.e. subsets, are selected by applying a stratified k-fold approach [28], which divides the training set in homogeneous splits of healthy and damaged data samples. Each of the models tuned with the optimal set of hyperparameters, is then fitted to the full set of training data.

Performance estimation.
The key performance indicators of the models' detection capability are derived from the confusion matrix (Table 4). These are the accuracy (acc), true damage (positive) ratehere referred to as true detection rate (TDR) -, and false damage (positive) rate -here referred to as false detection rate (FDR) -, as defined in equations from (1) to (3) respectively. While the acc gives an overall indication on the goodness of the classifier, the TDR refers to the models' ability to detect the damage. A satisfactory TDR is assumed for values above 70%. The FDR, instead, gives an indication of the percentage of false alarms raised by the classifier. A satisfactory FDR is considered if below 30%. Receiver operating characteristic (ROC) curves [29] and feature ranking plots are employed during the tuning and training phases to track the performance and validity of the prediction. Additionally, when performing classification, it is advised to investigate the probability corresponding to the predicted category. This probability gives a measure of the confidence on the prediction and is presented in the so-called reliability curve [30]

Results
The results of the training and testing of the classification algorithms are reported in Table 5. Unsatisfactory performance at the tuning and training stage are seen by the following models: • Gaussian naïve Bayes (NB). Based on the strong (naïve) independence assumptions between the features, it is here implemented by selecting the first 9 principal components (eigenvalues) corresponding already to more than the 90% of all the variance (as explained in [31]). Yet, inacceptable performance is given, probably because of violation of the normal distribution assumption for the numerical predictors. • K-Nearest Neighbour (KNN). Despite of the broad and small stepped range of K given during the cross-validation tuning and fitting, this algorithm, given its implementation in [27], fails in finding boundaries separating the two classes. On the contrary, results worthy of further investigation are obtained by: • Logistic Regression (LR). Logistic regression is a linear method that models a binary dependent variable, where the predictions are transformed using the logistic (sigmoid) function. As for linear regression, the model can overfit if there are multiple correlated inputs [32]. Here, it does not seem to happen, despite the high dependency of some of the features. • Support Vector Machine (SVM). The SVM approach aims to find a line, surface or hypersurface for the separation of the classes. When applied to the data, it fails in finding a linear hyperplane for a correct classification. On the other hand, by projecting the data into a higher-dimensional space defined by polynomials (poly) and Gaussian (radial) basis functions (rbf), the models manage to capture the nonlinearity of the classification problem yet requiring a higher computational time. • Random Forest (RF). This method fits a number of decision tree classifiers on various subsamples of the learning datasets, providing as output the average among the single trees' predictions. This 'trick', together with the limitation in the number of features for the trees and their depth, aims to control the over-fitting. By testing the LR, SVM and RF classifiers on the test set for stochastic variations of the environmental conditions, it is observed that, for below rated conditions, their predictions are generally satisfactory, distinguishing successfully the normal operating conditions from the damage status (cp. Table 5). In contrast, substandard performance is obtained for above rated conditions. This could be explained by the higher fluctuation of the tower top acceleration in above rated operating conditions.
In a next step, the algorithms are tested on the subsets of data corresponding to the response of the structure to variation of the TI, according to the curves in Figure 2 (test subsets Te1 to Te4). It is evident from Table 5 that none of the models can perform such generalization. Consequently, a re-training iteration is carried out by updating the training subset by either increasing the amount of data samples considered (varying the D datasets) or changing the amount and/or type of sensors employed (varying the S sensor set up per dataset). For brevity, the results are following reported only for the RF classifier in Table 6. The results of LR and SVM are given in the Appendix.  Table 5. On the left, the performance of the best classifiers for the dataset D0 (trained on Tr0), sensor set up S0. On the right, the colour legend (targeted performances in green).

Varying training datasets.
The algorithms are at first re-trained by adding the data samples corresponding to the varying turbulence levels to the dataset of the design load case (datasets D1, D2 and D3). Acceptable results are achieved for RF classifier only, while LR and SVM do not exceed 60% accuracy. The targeted accuracy and detection rate are generally achieved for prediction on the test set Te33 and the mid-low TI (test set Te4) conditions, when adding either the low-TI or both the extreme-TI data samples to the training set (Tr2 and Tr3, respectively). However, this achievement is associated to a high false alarm rate (up to 45% for above rated conditions). Table 6. RF classifier performance for the datasets D1, D2 and D3, in combination with the sensor S0, S1, S2 and S3.

Varying sensor sets.
By adding the ten-minute statistics of the time signals from the tower base inclinometer to the SCADA data (sensor set up S1), significant improvements are achieved by RF. Due to the yet high amount of false alarms for TI level above the 90 th percentile curve (test sets Te1 and Te2), the classifier is re-trained by adding the statistics of tower base strain gauge signals (sensor set up S2), including the DELs as additional features. While only a slight improvement is recorded again for RF (at below rated condition mainly), it is interesting to observe that LR and linear SVM accomplish exceptionally good results (cp . Table and Table in the Appendix). This is identified as symptom of the instability of the models due to the addition of collinearity into the analysis. The rotation and bending moment signals are, indeed, strongly correlated. The effect of this phenomenon is somewhat reduced in the RF classification thanks to random selection of a reduced number of features at each node. In line with this logic, and by recognizing a generally high correlation of the acceleration signals with the power, the shaft rotational speed and the wind speed (R values above 0.9), a re-training based on the use of the inclinometer statistics instead of the tower top accelerometer is attempted (sensor setup S3). The 10 results confirm the hypothesis, showing improved performance for the RF, and unsatisfactory results for the LR and SVM classifiers. Finally, by extending the training subset with data from the lowest TI level (Tr2), it is observed that the targeted performance is achieved on all the test sets (Te33, Te1, Te3 and Te4) for the RF classifier, at both below and above rated conditions. It is worth noting that also the SVM classifier achieves generally satisfactory performance for this dataset-sensors combination (D2-S3). In particular, acceptable detection rate and low number of false alarms are achieved on the test sets Te33 and Te4, at below rated conditions, by employing a rbf kernel transformation (cp. Table in the Appendix).

Discussion
The confidence of the classifiers in their prediction is discussed based on the reliability curve in Figure 5. The predicted probabilities for the damaged class are divided into bins -along the x-axis. The number of predicted damaged events are then counted for each bin and normalized on the y axis (observed relative frequency). A well calibrated binary classifier should classify the samples such that, for instance, among the samples to which is associated a probability of 0.9, approximately 90% of the cases are classified as damaged. Therefore, the more reliable a forecast is, the closer the points will appear along the main diagonal ("perfectly calibrated"). For points of the curve below the diagonal, the model has over-forecasted the probability, while above the diagonal the probability forecasted is too small. The reliability curves of the RF classifiers present their typical sigmoid shape ( Figure 5) [33]. This means that the algorithm is overconfident on small predicted probability and underconfident for big predicted probabilities. This behavior is common for RF, because the average predictions from the baselevel trees can have high variance due to feature sub-setting. A slightly better confidence of the RF prediction is given when trained on D3-S0 and tested on medium-high TI levels (Te3). Applying specific calibration techniques seems necessary for the predictions associated to high-TI levels, where a higher under-confidence of the RF models is shown by the histogram peaks moving further away from 0 and 1 ( Figure 5, right). Therefore, although the derivation of the optimal algorithm is not in the scope of this paper, a re-calibration activity is suggested for further study, potentially improving the RF classifiers' performance [33]. When trained/tested on the D2-S3 combination, it is shown in Figure 5 that the RF-based algorithms are able to extend their prediction to all operating conditions, even for significant variation of the TI levels. Although the SVM models do not exhibits as good performance as the ones of the RF classifiers, the rbf-SVM classifier in the combination D2-S3 for below rated conditions is very confident in its predictions ( Figure 5, right).

Conclusions and future work
The analysis carried out highlights the feasibility of an approach for the indirect monitoring of a structural failure through the low-frequency statistic of the operational data from the offshore wind turbine. Supervised algorithms are trained on ten-minute SCADA data derived from simulating the structural response and power performance associated to the healthy-status jacket structure, and to a disconnection of one of the jacket brace members. The overfitting of the algorithms is controlled by applying a cross validation approach and by extensively testing their performance on subsets for variation of the stochastic representation of wind and wave time histories and turbulence intensity levels.
It is observed that, although the tower top accelerometer can give indications on the presence of the structural damage, its signals are highly affected by variations in the environmental conditions, making the classification activity of the algorithms harder. Acceptable performance in the accuracy and detection rate of tree-based classifiers are obtained mainly for below rated conditions. However, the dataset for the algorithm training must be extended by additional data samples for correctly classifying the structural integrity status through a wide range of turbulence intensity levels. Furthermore, the high number of false alarms recorded can be reduced either by prior information on the turbulence intensity level (installing a specific sensor for this purpose) or better, by replacing the information from the tower top accelerometer with this obtained of an inclinometer positioned at the tower base. In this way, significant improvements are achieved in the detection skills of the random forest classifier at all operating conditions, and of the support vector classifier for the below rated case. This suggests the use of this sensor setup for further analysis of this type of damage.
It should be noted that the classification models developed in this paper depend on either the availability of data associated to the damaged structure, or their simulation through a true digital twin model of the structure [34]. Furthermore, a broad set of isolated ten-minute SCADA data is required for the off-line training and testing of the algorithms, to ensure satisfactory detection performances among the several operating and environmental conditions. The practical value of this approach for the detection of the status of an operational system can be achieved by extending analysis to multiple damaged conditions, either by increasing the number of labels classified, or the number of classifiers. Therefore, future work needs to prove the applicability of this detection method by addressing the following: • understand the limits of an evaluation based on simulated data and establish the requirements for the integration of real data from the operating turbine. • investigate the capabilities of unsupervised (or semi-supervised) approaches for the creation of normal behaviour models of the turbine. • handle the detection of multiple failure modes, failure locations, and their potential simultaneous occurrence.
The results of the logistic regression (LR) and the support vector machine (SVM) -based algorithms are here presented. For brevity, the performance on the test sets are reported in terms of accuracy (acc) only. However, the confusion matrix (CM) of each classification test is given to allow deriving the other performance indicators (see Table 4). Overfitting is identified when a strong collinearity is introduced among the classification predictors, thus by implementing both the inclinometer and strain gauge measurements (sensor setup S2). Indicators of this phenomenon are: i) the improvement of the accuracy for both LR and SVM classifiers (cp. Table A and B), ii) the switch to a linear separation function for the SVM classifier, while more complex kernel transformations are employed for the all the other sensor combinations. Finally, the performance of the SVM-based classifiers on the combination D2-S3 are reported in Table C. At below rated conditions, the SVM model implementing a rfb kernel scores satisfactory accuracy (above 90%), detection rate (of about 90%) and alarm rate (below 10%) on the Te33 and Te4 tests set. Table A. LR classifiers performance on the dataset D0, sensor set up S0, S1, S2 and S3 Table B. SVM classifiers performance on the dataset D0, sensor set up S0, S1, S2 and S3