This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

This paper presents a novel approach for estimating the ego-motion of a vehicle in dynamic and unknown environments using tightly-coupled inertial and visual sensors. To improve the accuracy and robustness, we exploit the combination of point and line features to aid navigation. The mathematical framework is based on trifocal geometry among image triplets, which is simple and unified for point and line features. For the fusion algorithm design, we employ the Extended Kalman Filter (EKF) for error state prediction and covariance propagation, and the Sigma Point Kalman Filter (SPKF) for robust measurement updating in the presence of high nonlinearities. The outdoor and indoor experiments show that the combination of point and line features improves the estimation accuracy and robustness compared to the algorithm using point features alone.

Reliable navigation in dynamic and unknown environments is a key requirement for many applications, particularly for autonomous ground, underwater and air vehicles. The most common sensor modality used to tackle this problem is the Inertial Measurement Unit (IMU). However, inertial navigation systems (INS) are proved to drift over time due to error accumulation [

The simplest fusion scheme for a vision-aided inertial navigation system (VINS) uses separate INS and visual blocks, and fuses information in a loosely-coupled approach [

While the vision-aided inertial navigation has been extensively studied, and a considerable amount of work has also been dedicated to processing visual observations of point features [

On the other hand, points are crucial as they give more information than lines. For instance, there are no pose constraints imposed by line correspondences from two views, while there are well-known epipolar geometry constraints for point correspondences from two views [

In this paper, we propose a method that combines point and line features for navigation aiding in a simple and unified framework. Our algorithm can deal with any mixed combination of point and line correspondences utilizing trifocal geometry across two stereo views. In the implementation, the inertial sensors are tightly-coupled within feature tracking to improve the robustness and tracking speed. Meanwhile, the drifts of inertial sensors are greatly reduced by using the constraints imposed in the tracked features. Leveraging both of the complementary characteristics of the inertial and visual sensors and the complementary characteristics between point and line features, the proposed algorithm demonstrates improved performance and robustness.

The remainder of this paper is organized as follows: we describe the mathematical model of the VINS in

We denote scalars in italic lower case letters (e.g.,

In the remaining Sections, unit quaternions are also used to describe the relative orientation of two reference frames, e.g.,

Finally, to represent projective geometry, it is simpler and more symmetric to introduce homogeneous coordinates, which provides a scale invariant representation for point and line in the Euclidean plane. In this paper, vectors in homogeneous coordinate form are expressed by an underline, e.g.,

The evolving IMU state is described by the vector:

In this work, we model the biases

In this Section, we consider the standard perspective camera model, which is commonly used in the computer vision applications. Let

A trifocal tensor is a 3 × 3 × 3 array of numbers that describes the geometric relations among three views. It depends only on the relative motion between the different views and is independent of scene structures. Assuming that the camera matrices of three views are

(

Once the trifocal tensor is computed, we can use of it to map a pair of matched points

Compute the epipolar line

Compute the line

The transferred point is

Similarly, it is possible to transfer a pair of matched lines

In this Section, we exploit the trifocal geometry of stereo vision to deduce the measurement model. We depict the stereo camera configuration of two consecutive frames in

For simplicity, we assume that the IMU frame of reference coincides with the camera frame of reference. Thus, the terms in Equation (12) can be expressed as follows:

Two trifocal tensors,

From the corresponding point set

For line measurements, we also need a formulation to compare the transferred lines with the measured lines. Because of the aperture problem [

Similarly, the line observation function concerning the first, second, and fourth views is defined as:

As we process the point and line measurements in a unified manner after defining the corresponding error, we define the observation model in a single function:

As can be seen in the previous Section, the measurement models are implicit relative-pose measurements, which relate the system state at two different time instants (

The standard additive error definition is used for the position, velocity and biases, while for the orientation error

The continuous-time IMU error-state model may be given as a single matrix error equation:

Since the past pose is unchanged during the filter prediction step, its corresponding derivatives are zero:

Combining Equations (29) and (34), the continuous-time augmented error state equation is given by:

Each time a new IMU measurement is received, the nominal state prediction is performed by numerical integration of the kinematic Equations (3) and (33). In order to obtain the error covariance, we compute the discrete-time state transition matrix:

The elements of

The noise covariance matrix

The predicted covariance is then obtained as:

Since the measurement model is highly nonlinear, we employ statistical linearization for measurement updating, which is generally more accurate than the first order Taylor series expansion [

The predicted measurement vector is determined by propagating individual sigma point through the nonlinear observation function

The mean and covariance are computed as:

The filter gain is given as follows:

Then, the error state and error covariance are updated using the normal Kalman filter equation:

After measurement update, the estimated state

Finally, replace old state by current state and revise the corresponding error covariance:

We evaluate the proposed method using the publicly available KITTI Vision Benchmark Suite [

The announced gyroscope and accelerometer bias specifications are 36 deg/h (1

For point features, the fast corner detection (FAST) algorithm [

We extract lines using EDlines detector [

In order to reject mismatched features and features located on independently moving objects (e.g., the running car), we employ a chi-square test [

Sample image with extracted point (red) and line (green) features.

In this Section, we compare the performance of our algorithm with the following methods: (1) GPS/IMU localization unit, with open sky localization errors less than 5 cm; (2) VINS using only point feature; (3) pure inertial-only navigation solution; (4) pure stereo visual odometry [

The trajectory estimation results of different algorithms with the ground truth data are shown in

The motion trajectory plot on Google Maps. The initial position is denoted by a red square.

3D Position Errors of different solutions.

The overall RMSE of the outdoor experiment.

Methods | Position RMSE (m) | Orientation RMSE (deg) |
---|---|---|

VINS (points and lines) | 10.6338 | 0.8313 |

VINS (points only) | 16.4150 | 0.9126 |

Pure INS | 2149.9 | 2.0034 |

Pure stereo odometry | 72.6399 | 8.1809 |

We demonstrate the velocity and attitude deviations of the proposed method with the corresponding 3

The velocity estimation errors and 3

The attitude estimation errors and 3

Finally, the estimates of the gyroscope and accelerometer biases are depicted in

Estimated gyroscope and accelerometer bias.

To demonstrate the robustness of our algorithm in a textureless structured environment, we perform indoor experiments in a corridor scenario with textureless walls which lead to very few points being tracked in some frames. The test rig consists a PointGrey Bumblebee2 stereo pair, a Xsens MTi unit, and a laptop for data recording (

The accuracy specifications and sampling rates of the sensors.

Sensors | Accuracies | Sampling Rates |
---|---|---|

IMU | Gyro bias stability (1 ^{2} |
100 Hz |

Stereo Camera | Resolution: 640 × 480 pixels Focus length: 3.8 mm Field of view: 70° Base line: 12 cm | 12 Hz |

In

Performance in low-textured indoor environment: (

This paper presents a tightly-coupled vision-aided inertial navigation algorithm, which exploits point and line features to aid navigation in a simple and unified framework. The measurement models of the point and line features are derived, and incorporated into a single estimator. The outdoor experimental results show that the proposed algorithm performs well in cluttered urban environments. The overall RMSE of position and orientation is about 10.6 m and 0.83°, respectively, over a path of up to about 4 km in length. The indoor experiment demonstrates the better performance and robustness of combining both point and line features in textureless structured environments. The proposed approach which combines both feature types can deal with different types of environments with a slight increase in computational cost.

As part of future work, we aim to improve the proposed approach, by taking advantage of the structural regularity of man-made environments, such as Manhattan-world scenes,

This work was supported by Research Fund for the Doctoral Program of Higher Education of China (Grant No. 2012.4307.110006) and the New Century Excellent Talents in University of China (Grant No. NCET-07-0225).

Xianglong Kong, Wenqi Wu and Lilian Zhang conceived the idea and designed the experiments; Xianglong Kong and Yujie Wang performed the experiments; Xianglong Kong analyzed the data and wrote the paper.

The authors declare no conflict of interest.