Cost-effective, vision-based multi-target tracking approach for structural health monitoring

The displacement response of structures is an important parameter in structural health monitoring (SHM). Displacement responses can be applied in both structural performance monitoring and structural dynamic characteristics monitoring. To overcome the shortcomings of traditional contact sensors, a vision-based multi-point structural displacement measurement system equipped with an inexpensive surveillance camera and a consumer camera was developed herein. In addition, to reduce the computing time of target tracking, an improved region-matching algorithm based on the prior knowledge of structural deformation was proposed. Numerical results revealed that the improved region-matching algorithm could save computing time without reducing location accuracy. Moreover, static and dynamic loading tests were conducted on a scale model of a suspension bridge to validate the effectiveness of the proposed vision-based measurement system. Displacement responses and modal parameters obtained from the vision-based measurement system were compared with those of traditional contact sensors, and a satisfactory consistency was obtained. Hence, the proposed vision-based measurement system could be a cost-effective alternative to conventional displacement sensors and accelerometers for SHM.

modal parameter identification, adaptive hierarchical localization algorithm (Some figures may appear in colour only in the online journal) * Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Bridge structures are subjected to various external loads, such as vehicles, wind, and earthquakes. Structural health monitoring (SHM) is used to monitor and evaluate the performance of bridges, find out potential safety troubles in time, and take corresponding maintenance and reinforcement [1,2]. The displacement response of structures under loads is an important parameter for SHM, and it reflects the overall stiffness of a structure.
Traditional contact displacement sensors, such as dial indicators and linear variable differential transformers (LVDTs), are generally installed on the bottom face of bridges. Although contact displacement sensors can yield accurate measurement results, brackets are often installed under bridges to obtain relatively fixed points of these structures, and wiring is required for data transmission and instrument power supply; hence, the actual operation is very complicated. The total station and level can only measure static displacements of the structure without attaining real-time displacement responses under different loads [3][4][5].
To overcome the shortcomings of traditional contact sensors, different non-contact measurement methods, such as GPS, laser Doppler vibrometry (LDV), radar interferometry, and vision-based method, have been developed. The GPS technology can measure both static and dynamic responses with a maximum measurement distance of up to 30 km. However, it is mainly used in long-span bridges and high-rise structures because of its low measurement accuracy and low sampling frequency [6][7][8]. LDV has a high measurement accuracy; however, it can only measure structural displacements in short distances. Moreover, the cost of LDV is high; thus, it is very difficult to realize multi-point synchronous displacement measurements [9][10][11]. Radar interferometry can achieve remote measurements with a good resolution; however, it requires reflective surfaces mounted on structures [12]. With the continuous development of computer vision technologies and image acquisition equipment, structural displacement monitoring methods based on computer vision have attracted great attention in civil engineering because they can be applied for large-range structural monitoring with high accuracy [12]. In these methods, videos of monitoring structures are captured by video cameras and motion trajectories of measuring points are calculated by target tracking. Subsequently, displacements of monitoring structures are determined by a scaling factor of image coordinates to physical coordinates. Cameras are generally fixed far away from structures, eliminating the requirement of fixed support points used in contact displacement monitoring methods. In addition, as the field of view (FOV) of cameras can cover multiple measurement areas on structures, it is easy to realize multi-point measurements at a low cost.
The advantages of vision-based methods have been verified in numerous applications [13][14][15][16][17][18][19][20]. For example, Busca et al [13] measured the vertical displacement of a bridge by tracking a target plate fixed on the bridge using a vision sensor.
A square optical target was used for tracking, and dynamic displacements of multiple points on the bridge were simultaneously measured. Ribeiro et al [15] measured the dynamic displacement of a railway bridge based on the random sampling consistency algorithm. Based on the robust direction code matching algorithm, a vision sensor system for real-time displacement measurement was developed by tracking a natural target on the bridge. Feng et al [16] proposed an advanced template matching algorithm for real-time displacement extraction from video images. Bhowmick et al [17] measured fullfield displacement responses of continuous vibration edges of structural members by calculating horizontal and vertical components of the Lagrangian displacement of each pixel in each frame edge with sub-pixel accuracy.
Optical targets, such as LED lamps and laser spots, are often tracked to obtain high precision measurement results even in a dark environment. Dong et al [21] used LED lamps and black spots as tracking targets for structural dynamic displacement measurements of a simply-supported rectangular steel beam. Oh et al [22] measured the vibration of a threestory shear frame by tracking multi-LED targets. Ye et al [23] compared three types of image processing algorithms for structural dynamic displacement measurements and measured the displacement influence lines of an arch bridge by tracking optical targets. Wahbeh et al [24] developed a target equipped with low-power LEDs and measured the displacement of a long-span bridge by the proposed vision-based system. Park et al [25] used a motion capture system equipped with colorful LED lamps to measure three-dimensional displacements of a three-story frame. Maksymenko et al [26] measured the displacement of a concrete beam by tracking a laser spot mounted on the structure. Zhang et al [27] proposed a new method based on laser projection and image processing technologies for monitoring medium-small span bridges. Miguel et al [28] developed a novel sensing device equipped with a laser beam, a video camera, and LED lights to monitor displacements and rotations of bridges and structures. Tian et al [29] conducted field, remote, and multi-point deflection measurements of the Wuhan Yangtze River Bridge by tracking six LED targets mounted on the structure.
In the present work, to improve efficiency and reduce the cost, a vision-based multi-point structural displacement measurement system equipped with an inexpensive surveillance camera and a consumer camera was designed and an improved template matching algorithm was proposed. Both static and dynamic displacements were measured, and modal parameters were identified in a non-contact, cost-effective, and timesaving way. Static and dynamic loading tests were conducted on a 1/30 scale model of the Taohuayu Yellow River Highway suspension bridge to validate the effectiveness of the proposed vision-based measurement system. Displacement results obtained from the vision-based measurement system and LVDTs as well as modal parameter results obtained from the vision-based measurement system and accelerometers were compared.

Construction of vision-based measurement system
A vision-based measurement system is generally composed of an optical target consisting of multiple LED lamps, a camera, and a computer (figure 1). Multiple LED lamps are fixed laterally along the longitudinal direction of a bridge, and the camera is set on one side of the bridge. By adjusting the position of the camera, all measuring points are placed in its FOV. The position of the optical target changes with the deformation of the bridge.
The bridge vibration video is captured by the camera and stored in the computer. The video is then converted into grayscale images frame by frame. LED lamps are detected as multilight spots in grayscale images. The displacement of a measuring point is calculated by tracking the change in the center coordinates of a light spot.
All light spots in an image are first identified and marked, and the template matching algorithm is performed for each spot.
In figure 2, the single spot number is T j i , where j is the number of measuring points in a single image and i is the frame number of the image in the video. The center coordinates of target T j 1 is , whereas the center coordinates of the same target T j i in the ith image is Hence, the pixel displacement of the target can be calculated as For most bridges, especially small and medium-sized bridges, on the one hand, it can assume that there are no transverse deformation and longitudinal deformation since the vertical deformation is much larger than the deformation of other two direction. On the other hand, the vertical displacement under design load is in the level of millimeter. Therefore, only the vertical position of LED lamps changes at a small scale. Considering ∆x = 0, the pixel displacement of target T j i can be simplified as In order to obtain structural displacements from pixel displacements, a scaling factor is used to establish the relationship between pixel coordinates and physical coordinates [16]. In figure 3, when the optical axis of the camera is perpendicular to the object surface, the scaling factor can be calculated as or where H is the physical length on the object plane, h is the corresponding pixel length on the image plane, d pixel is the pixel size (mm pixel −1 ), Z is the distance between the camera and the object, and f is the focal length. The actual displacement can be calculated as where SF is the scaling factor between image and physical dimensions.

Improved template matching algorithm based on leaping computation
The traditional template matching algorithm searches a template in an image based on the global correlation calculation method, and its criterion for judgment is the normalized correlation coefficient N (i, j).  where M i,j (m, n) is the template selected from the first image, M i,j is the average gray value of all pixels in M i,j (m, n), T (m, n) is the image at time t,T is the average gray value of pixels in T (m, n) covered by M i,j (m, n), (i, j) is the coordinates of template translation, M and N are the width and height of the template, respectively. Hence, when N (i, j) = 1, the best matching point is attained. The traditional template matching algorithm searches the target object pixel by pixel in the matching region and calculates the correlation coefficient N (i, j) of each pixel. Therefore, as shown in figure 4, an adaptive hierarchical localization algorithm based on leaping computation is proposed in this paper to reduce the cost and time of matching calculation. The step size and matching area of each layer are reduced gradually. The flowchart of the proposed algorithm is described below.
Step 1: Take the upper left corner pixel of the original image (matching region R 1 ) as the origin and D 1 as the calculation step size. Calculate the correlation coefficient of each discrete pixel in this layer. Select a new matching region R 2 with point (x 1 , y 1 ), where the largest correlation coefficient is the center and S 1 is the radiation distance.
Step 2: Take the upper left corner pixel of the matching region R 2 as the origin and D 2 as the calculation step size. Calculate the correlation coefficient of each discrete pixel in this layer. Select a new matching region R 3 with point (x 2 , y 2 ), where the largest correlation coefficient is the center and and S 2 is the radiation distance.
Step 3: Take the upper left corner pixel of the matching region R 3 as the origin and D 3 as the calculation step size. Calculate the correlation coefficient of each discrete pixel in this layer. Select a new matching region R 4 with point (x 3 , y 3 ), where the largest correlation coefficient is the center and S 3 is the radiation distance. Step 4: Take the upper left corner pixel of the matching region R 4 as the origin and D 4 = 1 as the calculation step size. Calculate the correlation coefficient of each pixel in this layer. Select the pixel coordinates of the maximum correlation coefficient (x 4 , y 4 ) as the best matching point.
It should be noted that the selection of step size mainly depends on the matching region size and the template size. The step size must be less than the template size so that no pixels are missed during correlation calculations. Moreover, in order to improve calculation efficiency, the step size should not be less than 1/60 of the matching region size. Hence, based on these two basic rules, the step size can be selected as 1/30-1/60 of the matching region size.
In order to verify the accuracy of the proposed algorithm, template matching was conducted for two images with different resolutions: (a) original image size = 300 × 300 pixels and template size = 10 × 10 pixels and (b) original image size = 1536 × 1536 pixel and template size = 50 × 50 pixels. The adaptive hierarchical localization algorithm was carried out in one layer, two layers, three layers, and four layers, and the corresponding results are presented in table 1. It is noticeable that the adaptive hierarchical localization algorithm greatly reduced the computing time.
According to the vibration range of a structure, a calculation program is written to find and locate the template as a vertical strip in a small matching region.
For example, in figure 5, target objects of 25 × 25 pixels and matching regions of 225 × 30 pixels, 425 × 30 pixels, 625 × 30 pixels, 825 × 30 pixels are trimmed from an image of 1080 × 1920 pixels, and the corresponding object location calculation results are listed in table 2.
When the matching area size was doubled, tripled, and quadrupled, the computing time increased by about 13%, 20%, and 50%, respectively.

Spot center location method based on Gaussian surface fitting
The template matching method generates pixel-level displacements, which are not accurate for practical applications; hence, the Gaussian surface fitting method is used to achieve subpixel-level displacements.
The bridge deformation video is converted into grayscale images frame by frame. Let the matrix g (x, y) denote the gray values of different pixels in a grayscale image; hence, an image with M × N pixels can be expressed as LED lamps fixed laterally along the longitudinal direction of the bridge are detected as light spots in a grayscale image. In figure 6, the gray value of a light spot decreases gradually from the center to the edges; thus, the ideal distribution of the gray value is a two-dimensional Gaussian distribution (figure 7). In order to accurately measure the change in the center position of the light spot, the center position should be accurately measured.
When the FOV is fixed, the most effective way to improve the location accuracy is to enhance the resolution of an image acquisition equipment (increase the number of pixels). However, in order to track the center position of a light spot with consumer cameras, smartphones, and surveillance cameras available in the market, the accuracy of target location in an image needs to be improved by an image processing algorithm. In the field of computer vision and pattern recognition, numerous sub-pixel location algorithms have been developed to locate the center position of a light spot. Spot   center extraction methods can be divided into intensitybased and threshold-based algorithms [30]. Intensity-based algorithms include the gray centroid method, the Gaussian surface fitting method, and the paraboloid fitting method [31,32], and threshold-based algorithms include the ellipse fitting method and the Hough transformation method. It is reported that intensity-based algorithms have better accuracy than threshold-based algorithms [30]. Among intensitybased algorithms, the anti-noise ability of the gray centroid algorithm is weak; thereby, its accuracy is limited. Although the anti-noise ability of the weight centroid algorithm has been improved, its stability is still insufficient. The Gaussian surface fitting algorithm has high precision and good stability; however, its operation is very complex. The paraboloid fitting algorithm is a simplified form of the Gaussian surface fitting algorithm and has relatively poor accuracy and stability. Therefore, the Gaussian surface fitting algorithm is generally used to calculate the center position of a light spot.
The least square method is used for two-dimensional Gaussian fitting, and the maximum gray value point obtained by surface fitting is considered as the gray center point on the  surface. The equation of two-dimensional Gaussian distribution can be expressed as where I (x, y) is the illuminance of a light spot on the (x, y) plane, A is the light intensity amplitude, σ x and σ y are the standard deviations in x-and y-directions, respectively. Now, taking logarithms on both sides of equation (8), Further, introducing F = lnI (x, y) , a = − 1 (9), where [a, b, c, d, e] are to be estimated. In addition, (x i , y i , F (x i , y i )) ∈ B represents all points in the template for Gaussian fitting. According to equation (10), the objective function could be determined (equation (11)). Moreover, the parameters [a, b, c, d, e] are estimated by the least square method.
Let ∂ε 2 ∂a = 0, ∂ε 2 ∂b = 0, ∂ε 2 ∂c = 0, ∂ε 2 ∂c = 0, ∂ε 2 ∂d = 0, ∂ε 2 ∂e = 0; thus, Equation (12) can be further transformed into Equation (13) is a positive definite matrix, which can be solved by the Householder transformation method. The center position and light amplitude of a light spot can be expressed as  Seven LED lamps were fixed on the main span at equal distances along the span direction of the bridge. One accelerometer was attached at each LED lamp position, and one LVDT was fixed at the middle of the main span. One LED lamp and one LVDT were fixed at the middle of two side spans.

Laboratory test of a self-anchored suspension bridge scale model
Measuring points from south to north were numbered as P1-P9 ( figure 8).
The sampling frequency of contact sensors (LVDTs and accelerometers) was 100 Hz. Displacement and acceleration responses were collected by a dynamic signal testing and analysis system (Model #DH5902N, Donghua Testing Technology Co. Ltd, China). Displacement responses of the model bridge were obtained by tracking the changes in the center positions of LED lamps using a DAHUA network surveillance camera and a Sony 4K camera. The specific model and technical parameters of each data acquisition equipment are listed in table 4.
Static loading tests were mainly carried out on the two side spans, and a static concentrated load (two people) was applied in two levels at the middle of each side span (numbered as S1 and S2). During dynamic loading tests, the bridge was excited in two ways: continuous jumping and running of one person. The continuous jumping scenario was numbered as J1, and low-speed and high-speed running scenarios across the bridge were numbered as R1 and R2, respectively. The specific descriptions of each scenario are listed in table 5. The DAHUA camera and the Sony camera were fixed on the stable ground in front of the north side of the model bridge, and the straightline distance between the cameras and the middle of the bridge was 24.0 m. Moreover, the cameras were fixed on tripods, and their optical axes were kept perpendicular to the bridge facade. In order to prevent disturbances caused by manual operation, the cameras were controlled by a Bluetooth remote control device. Figure 9 is the test site photo.

Test results and analysis
3.2.1. Multi-spot center location. When the recorded video was converter into grayscale images frame by frame, it was found that the gray value of LED lamps in a grayscale image was significantly larger than those of other areas (figure 10).  In this experiment, 5 × 5 square filter window templates were used for the median filtering of images. Median filtering is a common nonlinear filtering method. In this technique, the gray value of every pixel point in an image is replaced with an intermediate value after sorting the gray values of all pixels in the neighbouring window of a point. Median filtering can effectively remove random noise and pulse interference in an image and keep edge information. The center pixel coordinates of a light spot at different times under scenario J1 are displayed in figure 11. The coordinates and gray values of each pixel in region of interest were calculated by I (x, y) = Aexp . The coordinates of the center point on the surface were determined by the least square method. The displacements of each measuring point at different times are presented in figures 12 and 13. The vibration displacement time histories of the targets were obtained by concatenating the processed results of each frame with time. If the coordinates of the jth light spot in the ith frame are ) .

Displacement measurement under static loading
scenarios. Static loading tests were carried out to measure the displacements of P1 and P9. The distances between these two points and the camera were 15.5 m and 33.6 m, respectively. According to camera specifications, the scaling factors of these two points were 5.5 mm pixel −1 and 11.9 mm pixel −1 , respectively. The concentrated load was applied to the middle of the side span in two stages. After the first loading stage, data were collected for about 10 s. Similarly, after the second loading stage, data acquisition was performed for 35 s. Two stages were also adopted during unloading. LVDTs and the DAHUA camera were used to record the displacement curves of P1 and P9 in the loading process (figures 14 and 15), and displacement curves detected by the camera were well consistent with those collected by LVDTs. The values in each loading stage were calculated by averaging all measured values. The displacement measurement errors between LVDTs and the camera in different loading stages are listed in table 6. It is noticeable that the maximum errors of P1 and P9 were 5.54% and 8.45%, respectively; thus, fulfilling the requirements of engineering measurements.     It is noticeable from figure 16 that a high degree of consistency was achieved between the displacement results of the                 reveals that the cost of the proposed system is much cheaper than that of the conventional system. Moreover, according to table 4, as the cost of LED lamps is much cheaper than that of contact sensors, the spatial resolution of vibration modes can be improved by increasing the density of LED lamps the bridge facade with little cost increased.

Conclusions
In order to improve efficiency and reduce costs, a visionbased measurement system equipped with an inexpensive surveillance camera and a consumer camera was developed. In addition, an improved target tracking algorithm based on the prior knowledge of structural deformation was proposed for displacement measurement and modal parameter identification. Static and dynamic loading tests were conducted on a 1/30 scale model of the Taohuayu Yellow River Highway suspension bridge to validate the effectiveness of the proposed vision-based measurement system. Displacement results obtained from the vision-based measurement system and LVDTs as well as modal parameter results obtained from the vision-based measurement system and accelerometers were compared. The main observations of this investigation are summarized below.
(a) The improved template matching algorithm based on leaping computation greatly saved computing time. According to the characteristics of bridge deformation, the matching region was pre-selected to reduce computing time. When the size of the matching region was quadrupled, the computing time increased by about 50%. (b) In static loading tests, a good consistency between displacement results obtained from the inexpensive surveillance camera and LVDTs was noticed, and the maximum error was calculated as 8.45% at a distance of 33.6 m. (c) In dynamic loading tests, dynamic displacements obtained from the consumer camera and LVDTs maintained great consistency with a maximum error of 2.90%. (d) Low-order frequencies and mode shapes were obtained by the modal analysis of vibration displacement time histories. The maximum errors of low-order frequencies for the consumer camera and the surveillance camera were 0.93% and 1.82%, respectively. Hence, natural frequencies and mode shapes identified by the cameras were in good agreement with those detected by the accelerometers.
Therefore, the proposed vision-based measurement system could offer a non-contact, cost-effective, and time-saving alternative to conventional displacement sensors and accelerometers for SHM. Although dynamic displacement results obtained from the inexpensive surveillance camera were not ideal, modal parameter identification results were found to be satisfactory.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).