An accurate and robust visual-inertial positioning method

The human–machine integrated coordinate measurement is a promising coordinate measurement method with high flexibility and efficiency for the complex working environments. The cameras installed on the head-mounted measurement device achieves accurate global positioning by observing the uncoded LED landmarks, and then combines with the local measuring to obtain 3D coordinates. However, limited by the frame rate of the camera, the fast movements of the operator’s head may cause landmark misidentification and visual positioning failure. In order to improve the robustness, a visual-inertial positioning method is proposed in this paper. An inertial measurement unit (IMU) is added to compensate for the deficiency of the visual positioning and enhance the dynamic performance. An adaptive extended Kalman filter (EKF), which adjusts the measurement noise covariance matrix based on the visual positioning uncertainty, is established to obtain the optimal state estimation. And an efficient initialization procedure is presented to implement the initial registration of uncoded landmarks based on the normal distribution transform algorithm and to determine the initial state of the IMU. Furthermore, the residual chi-square test is employed to detect false pose estimate in real time and to avoid positioning failure. The experiments demonstrate that the proposed method has high static positioning accuracy (0.681 mm) and high dynamic positioning robustness. The adaptive EKF realizes reliable landmark identification under fast movements and provides a higher accuracy than the common EKF.


Introduction
The three-dimensional (3D) coordinate measurement is a key technology of intelligent manufacturing and plays a crucial part in the improvement of product quality and working efficiency [1].However, existing measurement methods show poor flexibility in the complex working environments such as aerospace and shipbuilding.Since the working space is narrow and the objects are densely distributed, the measurement systems such as laser tracker and indoor GPS are limited by the layout and obstruction, resulting in low accuracy and efficiency [2][3][4].
The human-machine integrated coordinate measurement is a promising solution for the complex working environments.As shown in figure 1, the operator wears the head-mounted measurement device (HMMD) and obtains the 3D coordinates of the target by the global positioning and local measuring.And the virtual information such as the measurement result is overlaid onto the target based on the augmented reality (AR) technology.The human-machine integrated coordinate measurement method maximizes the operator's flexibility and greatly enhances the portability and interactivity.In the measurement process, the global positioning that determines the position and orientation of the HMMD is the key to the accuracy control.Benefiting from the development of the camera performance and image processing techniques, visual positioning methods show great advantages in accuracy and flexibility.The camera mounted on the HMMD can achieve autonomous positioning by observing the point features and line features in the surrounding environment.However, limited by the frame rate of the camera, various factors such as motion blur and feature mismatch may cause the visual positioning to fail.Furthermore, the fast movements of the operator's head, which can reach a peak speed of 360 • s −1 [5,6], raise high requirements for dynamic performance of the global positioning.Therefore, it is essential to develop a positioning method with high accuracy and robustness.
Visual positioning methods can be classified into two categories: natural feature based methods and artificial feature based methods.Visual simultaneous localization and mapping (SLAM) systems are typical natural feature based positioning systems and are widely used in AR, autonomous navigation and 3D reconstruction [7][8][9][10][11].The popular wearable AR device Microsoft HoloLens adopts the visual SLAM algorithm to realize motion tracking [12].Furthermore, much work has been done to combine SLAM and deep learning algorithms to boost the performance of depth prediction and feature matching [13][14][15].However, the natural feature based methods can hardly detect enough features in the weak texture environments such as a ship cabin, which reduces the positioning stability.
Compared with natural features, customized artificial features are easier to identify and thus provide higher stability.Li et al developed a visual localization system based on encoded square patterns and eliminate the accumulated errors through the Bayes Estimation method [16].Zhong et al designed artificial landmarks patterned with a set of concentric circular rings in black and white and used a bilayer recognition algorithm to improve the positioning robustness [17].Kong et al presented a novel centripetal circular coded target to realize an automatic and accurate approach for marking ground control points [18].However, these passive landmarks are easily interfered by illumination, which limits their applications in complex working environments.In contrast, active landmarks represented by LED have strong anti-interference capability and receive increasing attention.Welch et al developed the HiBall Tracking System using infrared LEDs installed in ceiling panels to achieve head tracking with high accuracy and low latency [19].Weng et al utilized primary and assistant LED markers to avoid the synchronization between the markers and the camera [20].Hijikata et al presented a simple self-localization method that used a single camera observing several LEDs set at known positions [21].Liu et al developed a vision localization system with LED array targets and a camera installed on a rotating platform for autonomous guided vehicle (AGV) navigation [22].In order to better meet the needs of the human-machine integrated coordinate measurement, an accurate positioning method using a multi-camera system (MCS) with non-overlapping views was previously proposed [23].By observing the specially designed LED landmarks with known 3D coordinates, the MCS can locate itself based on the space resection adjustment algorithm.The proposed method gave a root mean square (RMS) error of 0.393 mm in a 5 m × 5 m × 3 m room.However, since the LED landmarks are not coded, fast movements of the operator's head can cause the landmark identification to fail.
With the advantages of high sampling rate and strong independence, the inertial positioning technology offers good dynamic performance and can be an effective complementation to the visual positioning [24].Inertial positioning can provide accurate pose measurements in a short time to avoid failure of visual positioning under fast movements.Meanwhile, the absolute accuracy of visual positioning can be utilized to correct the accumulated error of inertial positioning over time.Hence, the visual-inertial fusion method shows great potential for the accurate and robust positioning, and becomes the research hotspot.According to whether the image features are added to the state vector, visual-inertial positioning methods can be classified into two categories: loosely coupled methods and tightly coupled methods [25].
Loosely coupled methods process visual information and inertial information separately, and then fuse the pose estimation results of the two positioning modes.Compared with tightly coupled methods, loosely coupled methods have lower computational cost and thus are easier to run in real time.The fusion method based on extended Kalman filter (EKF) is the most commonly used method in the loosely coupled visual-inertial positioning [26].The inertial measurements are used for the EKF prediction, and the visual pose estimation is treated as a black box and used in the update phase.The EKF based visualinertial fusion relies on accurate system models and noise characteristics.However, the system noise and measurement noise may vary considerably in highly dynamic and complex environments.The filters with constant noises will lose accuracy and even diverge.Hence, various adaptive Kalman filters have been proposed to enhance the robustness of sensor fusion, such as Sage-Husa adaptive filter, fuzzy adaptive filter and neural network adaptive filter [27][28][29].Zhang et al proposed an adaptive visual-inertial odometry via interference quantization based on image information entropy to eliminate the impact of dynamic objects in the scenes [30].Yue et al presented a three-stage adaptive filtering algorithm based on the limited memory exponential weighting theory to improve the robustness of visual-inertial odometry systems [31].However, these adaptive filters have complex algorithms and poor realtime performance, which limit their responsiveness to the abrupt changes in the system state.
In this paper, an accurate and robust visual-inertial positioning method is proposed for the human-machine integrated coordinate measurement in complex working environments.Based on the previous work, an inertial measurement unit (IMU) is added to the MCS to improve the dynamic performance.An adaptive EKF, which adjusts the measurement noise covariance matrix based on the visual positioning uncertainty, is developed to achieve visual-inertial sensor fusion.In addition, and efficient initialization procedure is presented to carry out the initial registration of the landmarks and the initial sate estimation of the IMU, including the gravity direction, velocity and gyroscope bias.The residual chi-square test is also used to detect tracking failures caused by the misidentification of landmarks.

Visual-inertial positioning methodology
In this section, a visual-inertial positioning method is elaborated for the human-machine integrated coordinate measurement.As shown in figure 2, the method mainly consists of three procedures: initialization, adaptive EKF and chi-square test.These three procedures form a closed-loop framework, enabling accurate and robust positioning.

Adaptive EKF
As the core of the visual-inertial positioning method, the sensor fusion based on the adaptive EKF is divided into three stages: state prediction, measurement and state update.In the state prediction stage, the IMU compensates for the deficiency of the MCS in dynamic measurement and thus enables reliable identification of landmarks.Based on the visual positioning uncertainty, the measurement noise covariance matrix is updated in real time to obtain the optimal state estimation.

State prediction.
The visual-inertial positioning model is illustrated in figure 3. The MCS implements the inside-out positioning using its observations of landmarks with known 3D coordinates.And the IMU performs pose estimation by sensing the angular velocity ω and acceleration a: ( The subscript m indicates the measured value, b denotes the measurement bias, and n represents the zero-mean white Gaussian noise.The biases are modeled as random walk processes ωb = n ω b and ȧb = n a b .The transformation (rotation matrix R M I and translation matrix T M I ) between the IMU frame O I − X I Y I Z I and the MCS frame O M − X M Y M Z M is fixed and needs to be calibrated in advance.The IMU frame is set as the HMMD frame, and thus the system state x is written as where p, v and q represent the position, velocity and orientation (in quaternion form) of the IMU with respect to the world coordinate system O W − X W Y W Z W .And the state estimation xk+1 at the time t k+1 can be predicted based on the following expressions (3)  where ∆t is the sampling period, g W is the gravity vector in the world frame, ω and ā represent the angular velocity estimate ω = ω m − ωb and acceleration estimate ā = a m − āb respectively.The matrix Ω ( ω) describes the quaternion derivative q = 1 2 Ω ( ω) q and is written as The rotation matrix RI W can be expressed in terms of the quaternion components as equation (5).The Kalman filter with the above state vector can directly describe the dynamic process of the positioning system.However, the direct Kalman filter has poor numerical stability, which may lead to singularities.Hence, the indirect Kalman filter with the error state vector is employed to reduce computational complexity and improve state estimation accuracy. ( The error state ∆x is defined as the difference between the true state x and the estimated state x: The angle vector θ is comprised of the yaw angle ψ, pitch angle ϕ and roll angle φ.And the rotation angle error ∆θ is derived from the following relation: where the operator ⊗ denotes the quaternion multiplication and k is the unit vector of the rotation axis.Since the rotation corresponding to the quaternion error ∆q is assumed to be small, the small angle approximation is used in the above equation.Then the variation characteristics of the error state variables with respect to time are given: The operator ⌊×⌋ represents the skew-symmetric matrix.The above error state equations can also be expressed as where T is the noise vector, F c and G c are the system matrix and noise matrix respectively: The system matrix F c and noise matrix G c are assumed to be constant in a sampling period, and thus the error state transition equation is given by The error state transition matrix F d is defined as In the practical application, the number of retained expansion terms is determined depending on the sampling frequency and computing power.The noise covariance matrix Q d can then be computed with where , σ 2 ωn , σ 2 ωn ) denotes the covariance matrix corresponding to the noise vector n.Eventually, the error state covariance matrix is obtained by 2.1.2.Measurement.When the MCS is triggered, the cameras simultaneously capture the LED landmarks in their respective fields of view.Since the landmarks are not coded, the inertial measurement information is used to achieve reliable landmark identification.First, the state of the ith camera with respect to the world frame (R ci , T ci ) can be estimated from the predicted state of the IMU (p , R I W ): where R ci M and T ci M describe the transformation between the MCS frame and the ith camera frame O ci − X ci Y ci Z ci .The coordinates of the image points can then be predicted by projecting the landmarks on the image plane of the camera: The above prediction process is based on the pinhole camera model, and c i is the principal distance, u 0 i , v 0 i are the image coordinates of the principal point, (dx i , dy i ) denote the pixel size, and (X Wj , Y Wj , Z Wj ) are the coordinates of the jth landmark in the world frame.The real image points are then matched with the predicted image points using the Nearest Neighbor method, which determines the landmark ID corresponding to the real image point.The reciprocity check is also employed to remove the outliers.
Once the image point identification is completed, the accurate pose of the IMU (p x , p y , p z , ψ , ϕ , φ) can be obtained based on the global space resection adjustment method:

Z Niu et al
Here (e x j , e y j ) denote the reprojection error, (∆u j , ∆v j ) indicate the lens distortion, T is equivalent to R ci M T M I + T ci M , and R ci can be expressed in terms of the Euler angles (ψ , ϕ , φ) as equation (19).Eventually, the following objective function is established to obtain the global optimal position and orientation, And the Levenberg-Marquardt algorithm [32] is employed to minimize the reprojection errors of all observed landmarks.
The measurement error of EKF is then defined as where the rotation angles ( ψ , φ , φ) are derived from the rotation matrix Hence, the measurement model can be written as where H is the measurement matrix, and the measurement noise n m follows the normal distribution with mean µ = 0 and covariance matrix R c = diag (σ 2 p , σ 2 θ ).The measurement noise covariance matrix is usually constant during the filtering process.However, the visual positioning uncertainty may vary greatly under different observation conditions.And constant measurement noise correspondingly fails to yield the optimal state estimates and even leads to filter divergence.Therefore, it is necessary to adaptively adjust the measurement noise based on the visual positioning uncertainty.According to the visual positioning model in equation (17), the uncertainty arises from the 2D coordinates of the image points and the 3D coordinates of the landmarks.However, there is a complicated coupling relation between the observations and parameters to be determined, which makes it impossible to directly compute the sensitivity matrix.Consequently, the implicit function is introduced to analyze uncertainty propagation law.
The visual positioning model can be expressed in the implicit function form as The covariance matrix U D corresponding to the position and orientation is then defined as The observations are usually assumed to be independent of each other and follow the Gaussian distributions, and thus the covariance matrix U C is written as Based on the derivative method of the implicit function, the sensitivity matrix can be indirectly computed by The above equations demonstrate that the positioning uncertainty can be determined using the nonlinear optimization result.Since the measurement noise n m is positively correlated with the visual positioning uncertainty, an adaptive factor λ based on the visual positioning uncertainty is added to adjust measurement noise covariance matrix R c in real time.The adaptively adjusted covariance matrix R k+1 c is then used in the state update stage, enabling more accurate pose estimation.The adaptively adjusted covariance matrix R k+1 c is then used in the state update stage, enabling more accurate pose estimation 2.1.3.State update.With the predicted covariance matrix P k+1|k and the measurement noise covariance matrix R k+1 c , the Kalman gain K is given by which describes the relative weight of the state prediction and the measurement.Then the error state and the corresponding covariance matrix can be updated by And the system state x k+1 can also be updated with the state estimation xk+1 and error state ∆x k+1 as (34)

Initialization
The initialization of the visual-inertial positioning system needs to be completed before the EKF based positioning.The initialization procedure is to obtain an accurate initial state estimation of the IMU including the gravity direction, velocity and gyroscope bias, which can speed up the convergence of sensor fusion.The initial registration of landmarks is also indispensable for the visual positioning.An efficient approach is proposed to accomplish the visual-inertial initialization by smoothly moving the HMMD over a distance, as shown in figure 4.

Visual initialization.
The visual initialization is to determine the initial pose of the visual-inertial positioning system with respect to the world frame, which is the prerequisite for the continuous EKF process.However, the landmarks are not coded, which brings great challenges to the landmark identification and absolute orientation.Hence, the visual initialization is divided into three steps: relative orientation, triangulation and landmark registration.
During the smooth movement of the HMMD, the image points of two adjacent frames can be matched using the Nearest Neighbor method.When n (n ⩾ 8) pairs of image points are matched between the first frame and the last frame, the relative orientation can be determined based on the epipolar constraint.First, the matched image points are normalized by The normalized image coordinates can effectively improve the numerical stability.The epipolar constraint can then be expressed as where the superscript f and l denote the first frame and last frame respectively, the fundamental matrix F describes the transformation of images coordinates.Since the epipolar constraint holds true with different scale factors, the fundamental matrix actually has eight degrees of freedom, which is why at least eight pairs of matched points are required for relative orientation.The following linear equation is then obtained: where ] T represents the elements of the fundamental matrix in a vector form.The above equation can be solved by the singular value decomposition (SVD) to obtain an estimate of the fundamental matrix.Then the essential matrix E is computed by where K denotes the camera intrinsic matrix.And the essential matrix can be further decomposed to recover rotation matrix R f l and translation matrix T f l of the last frame with respect to the first frame.The essential matrix is equivalent with different scale factors and thus its SVD satisfies the following expression, Four possible solutions of (R f l , T f l ) can be obtained from equation (40), and ρ is the scale factor.Only one of the four solutions satisfies the actual condition where the observed landmarks are in front of the camera.In order to determine the correct solution, the 3D coordinates of the observed landmarks need to be calculated by triangulation Using the relative orientation parameters, the coordinates of the landmarks in the camera coordinate system corresponding to the last frame X c l j = [X c l j Y c l j Z c l j ] T are written as The coordinates corresponding to the first frame are also obtained Then the only correct solution of (R f l , T f l ) is determined with the criterion Z c l j > 0 && Z cf j > 0. The above calculation process is based on the ideal triangulation model, resulting in significant coordinate errors.Hence, coordinates X c l j are further optimized by iteratively minimizing the reprojection errors.
Then, the normal distribution transform (NDT) algorithm is employed to complete the registration process [33], which is to determine the ID of the observed landmark.The NDT algorithm finds the optimal match based on the probability density distribution of the point cloud.For the observed landmarks, the centroid X c l c and the covariance matrix Σ c l are written as By decomposing the covariance matrix Σ c l , three eigenvalues (λ 1 , λ 2 , λ 3 )(λ 1 ⩾ λ 2 ⩾ λ 3 ) and their corresponding eigenvectors (w 1 , w 2 , w 3 ) are obtained.In order to intuitively reflect the statistical properties of the point cloud, the local coordinate system of the observed landmarks O L − X L Y L Z L is established.The origin is located at the centroid P c l c , and X L axis, Y L axis and Z L axis are along the directions of vector w 1 , vector w 2 and vector w 1 × w 2 .The coordinates X L j of the observed landmarks in the O L − X L Y L Z L frame can then be obtained.The coordinates are further normalized to eliminate the effect of the scale, and they are finally denoted as where ∥X L j ∥ 2 is the Euclidean norm of X L j .And the probability density function of the observed landmarks can be expressed as where Σ LN indicates the covariance matrix of the point cloud X LN , and |Σ LN | denotes the determinant.Then n landmarks that are adjacent to each other are selected from the actual landmarks in the world frame.According to the processing method used above, the local coordinate system of the n actual landmarks ÕL − XL ỸL ZL is established, and the normalized coordinates XLN j are also obtained.After the above coordinate transformation, the frame O L − X L Y L Z L and frame ÕL − XL ỸL ZL are assumed to be coincident, and the matching degree between the point cloud X LN and XLN is evaluated by comparing their likelihood.Hence, the point cloud XLN is directly mapped to the O L − X L Y L Z L frame, and the likelihood of the two point clouds can be written as Then another point cloud containing n actual landmarks are selected to calculate the likelihood with respect to the observed landmarks.When all eligible point clouds are evaluated, the one with the maximum likelihood is the optimal match of the observed landmarks.And the observed landmark X LN j is matched to the actual landmark XLN j one by one based on the Nearest Neighbor method, which means the corresponding landmark ID can be determined.
After the landmark registration is completed, the position and orientation of the HMMD with respect to the world frame during the movement can be determined through the visual positioning as described in section 2.1.2.

Inertial initialization.
The inertial initialization is performed on the basis of the visual initialization results, which includes the position p k and orientation q k of the IMU with respect to the world frame when taking the kth image.These discrete poses provide constraints for the motion model of the IMU.Since there exist multiple inertial measurements between two adjacent images, the relative motion can be calculated by the preintegration, where α I k I k+1 , β I k I k+1 and γ I k I k+1 denote the integration of position, velocity and rotation with respect to the state of IMU at time t k .However, when bias a b and ω b are adjusted, the above integration terms need to be recalculated, which greatly increases the computational cost.Hence, the first-order Taylor approximation is performed for the integration terms: where ∆a b and ∆ω b are the bias corrections, and J indicates the Jacobian matrix.The approximation avoids repeating the complex integration process in the bias optimization.The relative rotation in the time interval [t k , t k+1 ] can also be derived from the visual positioning results: The relative rotation constraint is then given by And the optimization function for the bias correction ∆ω b can then be established as where Γ represents all images during the movement, and ℑ denotes the imaginary part of the quaternion.Through the optimization, the gyroscope bias ω b = ωb + ∆ω b is obtained.The preintegration terms α I k I k+1 , β I k I k+1 and γ I k I k+1 are then recalculated with the updated gyroscope bias.
Based on the state propagation model, the preintegration terms α I k I k+1 and β I k I k+1 satisfy the following constraints, ( The above equations contain the velocities of IMU and the gravity direction in the world frame, which need to be estimated in the inertial initialization and are written as The parameters without unknowns in equation ( 52) can be defined as and thus the constraints can also be expressed as Consequently, the inertial initialization is transformed into a linear optimization problem, With the least squares method, the estimation of velocity vk and gravity direction ḡW is achieved.Since the magnitude of the gravity vector G is usually known, the gravity vector is restricted to two degrees of freedom, and thus it can be further refined by where ǧW is the unit vector along the gravity direction, the vector ǧW , ⃗ ϵ 1 and ⃗ ϵ 2 form an orthonormal basis ǧW ⊥ ⃗ ϵ 1 ⊥ ⃗ ϵ 2 .The refined gravity vector is obtained by optimizing the variable η 1 and η 2 .The gravity vector g W in equation ( 52) is replaced by equation ( 57), and the constraint equation with η 1 and η 2 is then derived as equation (58).After solving the above equation, the velocities and gravity vector are updated.At this point, the initialization process is completed, and the results will be used in the adaptive EKF.

Residual chi-square test
The proposed visual-inertial positioning method greatly improves the dynamic performance of the HMMD.However, due to the complexity of working environments and the diversity of movements, tracking failures caused by the misidentification of landmarks may still occur.Hence, it is necessary to detect the false pose estimates in real time.The residual chi-square test is an effective fault detection method based on the statistical property of the residual in Kalman filter, especially for the abrupt changes of the system state, and thus it is very suitable for the proposed visual-inertial positioning system.Based on the EKF framework in section 2.1, the residual of the filter is given by When there is no false pose estimate, the residual r k+1 follows the Gaussian distribution with mean µ = 0 and variance A k+1 = HP k+1|k H T + R k+1 c .When the false pose estimation occurs, the mean of the residual is no longer zero.Hence, the fault detection variable ξ is defined as and ξ follows a chi-square distribution with six degrees of freedom.The false pose estimate can then be judged according to the following criterion, where T d is the preset threshold that is determined by the false alarm rate and the chi-square distribution.When a false pose estimate is detected, only inertial measurements are retained for state prediction in a short period, until the visual positioning returns to normal.However, if the correct pose estimation is not obtained for a long time, due to the rapid accumulation of inertial measurement error, the initialization process needs to be repeated.

Experiment
In order to evaluate the performance of the proposed visualinertial positioning method, experiments are conducted under different conditions.First, the hardware used in the experiments including the HMMD and uncoded LED landmark is introduce.
As shown in figure 5, the HMMD mainly consists of three positioning cameras, one measuring camera, an IMU, a projector and a control unit.The positioning camera is Basler dart camera (daA1600-60um) with 2 MP resolution.The field of view (FOV) of the camera is 50 • with an 8 mm lens, and the three cameras have no common FOV.The IMU is Xsens MTi-10 MEMS IMU which gives 3D acceleration and 3D rate of turn.In the experiments, the sampling frequencies of the camera and IMU are set to 20 Hz and 200 Hz respectively, and these sensors are synchronized by the hardware trigger signals from the control unit.
Two different LED landmarks are shown in figure 6, and they can be applied to different working environments.The spherical landmark I is compatible with the spherically mounted retroreflector (SMR) of the laser tracker in dimension (1.5 in).The landmark II can cooperate with the photogrammetry system or the total station to obtain its own 3D coordinates.The both landmarks use the infrared LED, and their parameters are accurately calibrated with the TESA-VISIO 200 video measuring machine.

Landmark identification experiment
An important goal of the proposed visual-inertial positioning method is to solve the landmark identification problem under fast movements.The experimental scenario shown in figure 7 is set up to test the reliability of the proposed method.Twenty landmarks are arranged in the experimental site, and their 3D coordinates are calibrated using the Leica Absolute Tracker AT960.The operator wears the HMMD to complete a series of tests.
First, the visual-inertial positioning system is initialized following the steps described in section 2.2.In the visual initialization part, the eight-point algorithm is employed to sequentially implement the relative orientation, triangulation and landmark registration.And two key initial values for the visual-inertial fusion, the gravity vector in the world frame g W and the gyroscope bias ω b , are estimated through the inertial initialization and listed in table 1.
Then, the landmark identification problem under fast movements is analyzed.The operator turns his head quickly, and the visual measurements and inertial measurements are collected during the movement.The landmark identification process is then conducted without and with inertial information separately.The operator constantly changes his position  and attitude to acquire comprehensive experimental data.Ultimately, the landmark identification results in 80 different states are obtained and shown in table 2. And a typical example of landmark identification is illustrated in figure 8.
The movement data of the operator's head in the tests is also analyzed.A representative set of angular velocities measured by the IMU when the operator turn his head are shown in figure 9.It can be seen that the head reaches a high angular velocity regardless of yawing, pitching or rolling.In particular, the maximum angular velocity during yawing motion reaches 364 • s −1 , which is consistent with the statistics mentioned in the Introduction.
The above results show that the inertial measurement effectively recovers the motion information and greatly improves the accuracy of landmark identification under fast movements.The proposed visual-inertial positioning method is demonstrated to avoid misidentification and to achieve continuous tracking.

Circular trajectory experiment
In order to quantitatively evaluate the performance of the proposed adaptive EKF, the experimental scenario is designed as shown in figure 10.The HMMD is mounted on the turntable, which has an angular accuracy of 1 arcsecond.The landmarks are deployed about 4 m in front of the turntable, and their 3D coordinates are calibrated using the V-STARS photogrammetry system.When the turntable rotates around its vertical   axis, the motion trajectory of the HMMD should be a circular trajectory.However, due to the positioning error, the actual measured positions of the HMMD do not lie the same circle.Hence, the positioning accuracy of the system can be evaluated by calculating the circle fitting error (including the radial error and the planar error) of the actual trajectory.
Limited by the installation condition, the turning radius of the HMMD is small, which leads to a large uncertainty in the circle fitting.From another perspective, if the HMMD is assumed to be stationary, the positions of a landmark in the HMMD frame (−R I W • p) form a circle with a larger radius.Consequently, the circle fitting error of the world frame origin in the HMMD frame is calculated to compare different positioning approaches.
The static positioning accuracy of the HMMD is first tested.The turntable rotates 0.5 • each time, and then a set of images are captured.The rotation range with valid observations is about 135 • , and 273 sets of images are acquired.And for the dynamic positioning situations, the turntable is set to rotate at angular velocities of 10  (fast movement), and 287 sets and 85 sets of valid images are obtained respectively.The trajectories of the world frame origin O in the HMMD frame are then calculated using the visual positioning, the visual-inertial positioning based on the EKF and the visual-inertial positioning based on the proposed adaptive EKF.The trajectory fitting process is illustrated in figure 11, and the corresponding circle fitting errors including the RMS errors and the maximum errors are listed in table 3.
It can be seen from the experimental results that the HMMD provides a high static positioning accuracy of 0.681 mm.The inertial information effectively improves the dynamic performance of the positioning system and compensates for the large errors of the visual measurements, especially in the fast movement situations.In the test with an angular velocity of 40 • s −1 , the maximum fitting error is reduced by 32% after adding the inertial measurements.Compared with the     common EKF, the proposed adaptive EKF further reduces the fluctuation of the dynamic measurement and improves the robustness of the visual-inertial positioning.In the fast movement situation, the positioning accuracy of adaptive EKF is about 9% higher than that of common EKF.

On-site application in ship pipe unit assembly
The above experiments are conducted in the laboratory, and the working environments are relatively simple.In order to demonstrate the validity and applicability of the proposed positioning method in complex environments, an on-site application is carried out in the ship pipe unit assembly, as shown in figure 12. Pipe unit assembly is an important process of ship outfitting, and its task is to assemble outfitting parts such as pipes, valves and accessories into a pipe unit in the workshop.The assembled pipe unit is then hoisted to the designated position in the ship cabin for installation.And it is necessary to conduct on-site measurement of the assembly accuracy to ensure that the pipes are smoothly closed.The pipe units are densely placed at the assembly site, and there are interferences such as vibration and glare, which makes the working environment very complex.Hence, the human-machine integrated coordinate measurement at the pipe unit assembly site can effectively test the feasibility of the proposed positioning method.
The landmarks are quickly deployed at the assembly site with the built-in magnetic base and their 3D coordinates are calibrated using the V-STARS photogrammetry system.The operator wears the HMMD and performs coordinate measurements with the handheld probe.As shown in figure 13, the handheld probe has six infrared LEDs, and the coordinate of the probe tip can be calculated when the measuring camera of the HMMD captures these LEDs.
In the experiment, a laser tracker is used to evaluate the accuracy of the human-machine integrated coordinate measurement.Ten SMR target holders are placed at the assembly site, and their coordinates are measured with the laser tracker as reference values.Then another set of measurements are obtained using the HMMD and handheld probe.With the It can be seen from the results that the human-machine integrated coordinate measurement gives a RMS error of 1.329 mm, which also reflects the high accuracy of the proposed visual-inertial positioning method.Furthermore, the operator moves flexibly without being interrupted by occlusions or fast movements, which demonstrates the applicability of the positioning method in complex environments.

Conclusion
In this paper, an accurate and robust visual-inertial positioning method is proposed for the human-machine integrated coordinate measurement.The inertial information improves the dynamic performance of the positioning system and avoids the misidentification of uncoded landmarks.Compared with the common EKF, the proposed adaptive EKF obtains more accurate state estimates by adjusting the measurement noise covariance matrix based on the visual positioning uncertainty.The visual-inertial initialization procedure can be accomplished by smoothly moving the HMMD over a distance.And the visual initialization consisting of the relative orientation, triangulation and NDT algorithms is presented for the initial registration of landmarks.The high static positioning accuracy and dynamic positioning robustness are demonstrated by the experiments.
However, when the image features are missing for a long time due to occlusion, the inertial measurement error accumulates rapidly, resulting in positioning failure.And frequent initialization affects the work efficiency and user experience.Future research will focus on the natural features in the working environment and improve the robustness of the positioning method by establishing a multi-source information fusion model based on artificial control points, natural features and inertial information.
Furthermore, the proposed positioning method requires the laser tracker or photogrammetry system to build the control field, which limits the work efficiency.An important task in the future is to learn from the photogrammetry system and design coded targets and reference rulers to globally calibrate the landmarks with the proposed positioning system.

Figure 1 .
Figure 1.The human-machine integrated coordinate measurement method.

Figure 4 .
Figure 4.The initialization procedure of the visual-inertial positioning system.

Figure 6 .
Figure 6.Two different LED landmarks for the visual-inertial positioning method.

Figure 7 .
Figure 7.The scenario of the landmark identification experiment.

Figure 8 .
Figure 8.A typical example of the landmark identification showing the differences between the actual image points and the predicted image points.

Figure 9 .
Figure 9.The angular velocities of the operator's head.

Figure 10 .
Figure 10.The scenario of the circular trajectory experiments.

Figure 11 .
Figure 11.The circle fitting process for the trajectory of the world frame origin in the HMMD frame under the slow movement 10 • s −1 using the visual-inertial positioning based on the proposed adaptive EKF.

Figure 12 .
Figure 12.The scenario of the on-site application in the ship pipe unit assembly.

Figure 13 .
Figure 13.The handheld probe and hemispherical target for human-machine integrated coordinate measurement.

Table 2 .
The landmark identification results with different positioning methods.

Table 3 .
The circle fitting errors of the world frame origin in the HMMD frame with different angular velocities based on different positioning approaches.0 • s −1 10 • s −1 40 • s −1

Table 4 .
The human-machine integrated coordinate measurement error. in figure13, the probe directly measures the coordinate of target's center.After the bestfit transformation, the human-machine integrated coordinate measurement errors are listed in table 4.