The first step in building an autonomous vehicle: 3D object tracking and sensor fusion


Some express companies have begun to pilot the delivery of express by UAV or unmanned vehicle; Some automobile brands are developing vehicles that can drive independently and have achieved certain results; Even some consumer electronic devices have begun to be equipped with laser positioning radars like LiDAR. The legendary future life seems to be within reach!

But you're not curious. How did this happen? The most important thing is, how do these autonomous vehicles judge the surrounding environment?

This starts with the tracking of 3D objects. This is a huge topic. This paper will mainly focus on how to use Amazon SageMaker Ground Truth The ability to tag 3D point cloud data performs data conversion on 3D point cloud data to tag 3D object tracking using SageMaker Ground Truth. For more details on other features, see AWS News blog.

3D object tracking based on SageMaker Ground Truth

Autonomous Vehicle (AV) manufacturers usually use LiDAR sensors to generate 3D imaging results of the surrounding environment of the vehicle. For example, they will install LiDAR on the top of the vehicle to continuously capture point in time snapshots of the surrounding 3D environment. LiDAR sensor outputs a series of 3D point cloud image frames (the common capture rate is 10 frames per second). In order to establish a perception system that can automatically track the key objects around the vehicle (such as other vehicles and pedestrians), automatic driving manufacturers often manually mark the objects in the 3D point cloud image frame first, and then use the marked 3D image frame to train the machine learning (ML) model.

In terms of building a sensing system, the most common approach is to use inputs from multiple sensors to make up for the limitations of a single sensor. For example, cameras can provide important contextual information - such as whether the current traffic signal is red, yellow or green. But its perception ability in dark conditions is very limited. On the other hand, LiDAR sensor can not provide specific context background (such as the color of traffic lights), but it can collect depth information within 360 degrees, regardless of external lighting conditions.

SageMaker Ground Truth can easily mark objects on a series of 3D point cloud frames to build machine learning training data sets, and supports the fusion of LiDAR sensing data input by up to 8 cameras. SageMaker Ground Truth requires pre synchronization of video frames and 3D point cloud frames for image fusion. After the sensor fusion function is enabled, the marker can view the 3D point cloud frame with the synchronized video frame. In addition to providing more visual environment information for marking, the sensor fusion function also projects any label drawn in the 3D point cloud to the video frame to ensure that the marking adjustment completed in one frame will accurately appear in another frame.

In this article, we will demonstrate how to prepare 3D point cloud data and synchronize video data for use by SageMaker Ground Truth. We will start from KITTI Vision Benchmark Suite At first, this is the existing popular automatic driving data set *. In addition to video data, the dataset also contains 3D point cloud data generated by Velodyne LiDAR sensor. The specific operation steps are as follows:

  • Specify the input data format and requirements of SageMaker Ground Truth 3D point cloud marking job.
  • Convert the KITTI dataset from a local coordinate system to a world coordinate system.
  • Associate video data with synchronous LiDAR data for sensor fusion.
  • Prepare a Manifest file for importing SageMaker Ground Truth.
  • Create a marking job for 3D point cloud object detection and track it across a series of frames.
  • Use the auxiliary marking tool in the staff marking UI interface.

To complete this round of walkthrough, use the Notebook instance under Ground Truth Labeling Jobs Amazon SageMaker Examples 3D point cloud input data processing Ipynb this Notebook. We can also get this on GitHub Notebook.

3D point cloud marking job input data

In this section, we will introduce the concept of Ground Truth input data and the basic requirements for 3D point cloud marking.

3D point cloud frame and 3D point cloud sequence

The so-called "point cloud frame" refers to the collection of all 3D points in a 3D scene at a certain time. Each point is described by three coordinates, x, y and Z. Adding color and point intensity changes to the point cloud can make the point have other attributes, such as intensity i or the value (8 bits) of red (r), green (g) and blue (b) color channels. All position coordinates (x, y, z) are in meters. When creating Ground Truth 3D point cloud marking job, you can use point cloud data in ASCII or binary format.

The definition of "sequence" is the time series collection of 3D point cloud frames captured when LiDAR moves (such as LiDAR at the top of the vehicle). SageMaker Ground Truth defines the sequence file format as the ordered arrangement result of each frame, where each frame is associated with a timestamp.

In this article, we will demonstrate how to create a SageMaker Ground Truth sequence file based on the KITTI dataset and use it as a basis to create a marking job for 3D object tracking.

World coordinate system

In object tracking, you can track the trajectory of objects (such as pedestrians on the sidewalk) when the reference point (autonomous vehicle) moves. Our reference point is equipped with sensors to sense the surrounding environment, which is also a common design idea of "autonomous vehicle".

When performing object tracking, we need to use the world coordinate system (also known as the global reference system), because the autonomous vehicle itself does move around the world. Generally, SageMaker Ground Truth requires us to convert the point cloud data to the reference coordinate system you selected. You can generally complete such conversion by multiplying each point in the 3D frame by the external matrix of LiDAR sensor. The external matrix of the sensor is a isomorphic transformation matrix, which is used to convert the perspective view of the data from the sensor's own coordinate system to the world coordinate system. Uniform transformation refers to the sequence transformation of three rotation axes (x-axis, y-axis and z-axis) and origin translation, and the rotation matrix is the 3x3 matrix that defines the rotation sequence of the three.

This article will introduce you how to use the external matrix to convert the KITTI 3D frame from the local coordinate system to the world coordinate system. KITTI dataset provides a corresponding external matrix for each 3D point cloud frame. We can use the GPS data from the autonomous vehicle to obtain the external matrix, and use the NumPy matrix multiplication function to multiply this matrix with each point in the frame to convert it into the world coordinate system used by the KITTI data set.

Through custom settings, you can also use the relative position and direction (latitude / longitude / height / roll angle / pitch angle / elevation angle) between GPS/IMU and LiDAR sensor on autonomous vehicle to calculate the external transformation matrix. For example, we can be based on_ pose = convertOxtsToPose(oxts)_ The vehicle attitude is calculated from the original KITTI data, and the oxts data is converted into the local Euclidean attitude specified by the 4x4 rigid transformation matrix. Next, the attitude conversion matrix can be converted into a global reference frame using the reference frame conversion matrix in the world coordinate system.

Sensor fusion

LiDAR sensor and each camera have their own external matrix. SageMaker Ground Truth uses them to realize the function of sensor fusion. To project the label from the 3D point cloud to the camera plane image, SageMaker Ground Truth needs to convert the 3D point from LiDAR coordinate system to camera coordinate system. This conversion is usually completed by using LiDAR external matrix to convert 3D points from LiDAR's own coordinates to the world coordinate system. Next, we use the inverse of the camera external matrix (from world to camera) to convert the 3D points of the world coordinate system obtained in the previous step into the camera plane image. If the 3D point cloud data has been converted to the world coordinate system, the first conversion is not required, and the conversion between 3D and 2D will only need to use the external matrix of the camera.

If there is a rotation matrix (composed of axis rotation) and translation vector (or origin) in the world coordinate system instead of a single external transformation matrix, the rotation and translation can be directly used to calculate the vehicle attitude. See the following code for details:

rotation = [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03],
[ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02],
[-3.24531964e-04, 1.03694477e-02, 9.99946183e-01]]
 origin= [1.71104606e+00
 5.80000039e-01
 9.43144935e-01]
from scipy.spatial.transform import Rotation as R
# position is the origin
position = origin
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy
heading = r.as_quat()

If we have a 4x4 external matrix and the matrix form is [R T; 0 001], where R is the rotation matrix and t is the origin translation vector, it means that we can extract the rotation matrix and translation vector from the matrix, as shown below:

import numpy as np
transformation
= [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03, 1.71104606e+00],
 [ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02, 5.80000039e-01],
 [-3.24531964e-04, 1.03694477e-02, 9.99946183e-01, 9.43144935e-01],
 [ 0, 0, 0, 1]]
transformation = np.array(transformation )
rotation = transformation[0:3][0:3]
translation= transformation[0:3][3]
from scipy.spatial.transform import Rotation as R
# position is the origin translation
position = translation
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy
heading = r.as_quat()
print(f"position:{position}nheading: {heading}")
Python

In sensor fusion, we can provide the external matrix in the form of sensor attitude through the origin position (for translation) and quaternion direction (for indicating the rotation of xyz axis). The following example code is a JSON file used in the input Manifest file:

{
 "position": {
 "y": -152.77584902657554,
 "x": 311.21505956090624,
 "z": -10.854137529636024
 },
 "heading": {
 "qy": -0.7046155108831117,
 "qx": 0.034278837280808494,
 "qz": 0.7070617895701465,
 "qw": -0.04904659893885366
 }
}

All position coordinates (x, y, z) are in meters, and all attitude orientations (qx, qy, qz, qw) are indicated by the orientation of quaternions in the spatial direction. For each camera, the attitude data extracted from the outside of the camera is provided respectively.

Camera calibration, internal and distortion

Geometric camera calibration (also known as camera removal) is used to calibrate the lens and image sensor parameters of the camera. We can use these parameters to correct lens distortion, measure the real size of the object, or determine the position of the camera in the scene. In addition, if the camera image is distorted, additional calibration parameters (such as internal and distortion) can also be provided to correct the image.
Camera parameters include intrinsic matrix parameters, external matrix parameters and distortion coefficient, Click here Learn more. See the following code for details:

# intrinsic paramaters
fx (float) - focal length in x direction.
fy (float) - focal length in y direction.
cx (float) - x coordinate of principal point.
cy (float) - y coordinate of principal point.
# Radial distortion parameters
k1 (float) - Radial distortion coefficient.
k2 (float) - Radial distortion coefficient.
k3 (float) - Radial distortion coefficient.
k4 (float) - Radial distortion coefficient.
# Tangential distortion parameters
p1 (float) - Tangential distortion coefficient.
p2 (float) - Tangential distortion coefficient.

The internal transformation of the camera is defined by the following formula, where * represents matrix multiplication.

|x| |f_x, 0, c_x| |X|
|y| = |0, f_y, c_y| * |Y|
|z| | 0, 0, 1 | |Z|

Import Manifest file

Ground Truth takes the input Manifest file, in which each line describes the task units that need to be completed by the annotator (or automatically mark some built-in task types). The format of the input Manifest file depends on your actual task type:

  • 3D point cloud object detection or semantic segmentation marking job - each line in the input Manifest file contains information related to a single 3D point cloud frame and sensor fusion data. This Manifest file is called the 3D point cloud frame input Manifest.
  • 3D point cloud object detection and marking job tracking - each line in the input Manifest file contains a link to a sequence file, which is responsible for defining a series of forward-looking data fused with 3D point cloud frames and sensors. Each sequence is called 3D point cloud sequence, and this Manifest is called point cloud sequence input Manifest.

In this experiment, we will create a point cloud sequence Manifest file. You can also modify the solution to create your own point cloud frame and enter Manifest. More details are provided in the sample Notebook.

Convert KITTI dataset to world coordinate system

We can use Notebook to run the code snippets in this section.

The world coordinate system is determined by the specific data set. Some data sets use the LiDAR position in the first frame as the origin, and all other 3D frames in the data set take the first frame as the reference object, including the driving direction and position of the vehicle. There are also some data sets that select equipment positions different from the origin as the starting point. KITTI dataset uses the position of the first frame as the reference object of its world coordinate system.

This article will show how to convert the KITTI dataset into a global reference frame relative to the LiDAR sensor origin in the first frame, so that SageMaker Ground Truth can be used in practice. The KITTI original data set contains the rotation matrix and translation vector of each frame. We can use this data to calculate the external matrix for each frame. It may be difficult to process the original data, so we suggest you use it pykitti Python module to reduce the processing threshold of KITTI dataset.

In a dataset, dataset oxts[i]. T_ w_ IMU is responsible for giving the LiDAR external transformation of frame I. We can multiply it with each point in the frame to use the NumPy matrix multiplication function matmul: Matrix (LiDAR_transform_matrix, points) converts it into a frame in the world coordinate system. Generally, the world coordinate system conversion can be realized by multiplying each point in LiDAR frame with LiDAR external matrix. Where t_ w_ The IMU convention represents the transfer from IMU to the world coordinate system (therefore marked as T_destinationFrame_originFrame).

The following example code illustrates how to convert a KITTI point cloud frame to a world coordinate system:

import pykitti
import numpy as np
basedir = '/Users/nameofuser/kitti-data'
date = '2011_09_26'
drive = '0079'
# The 'frames' argument is optional - default: None, which loads the whole dataset.
# Calibration, timestamps, and IMU data are read automatically.
# Camera and velodyne data are available via properties that create generators
# when accessed, or through getter methods that provide random access.
data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))
# i is frame number
i = 0
# customer computes lidar extrinsics
lidar_extrinsic_matrix = data.oxts[i].T_w_imu
# velodyne raw point cloud in lidar scanners own coordinate system
points = data.get_velo(i)
# transform points from lidar to global frame using lidar_extrinsic_matrix
def generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix):
 tps = []
 for point in points:
 transformed_points = np.matmul(lidar_extrinsic_matrix, np.array([point[0], point[1], point[2], 1], dtype=np.float32).reshape(4,1)).tolist()
 if len(point) > 3 and point[3] is not None:
 tps.append([transformed_points[0][0], transformed_points[1][0], transformed_points[2][0], point[3]])
 return tps
# customer transforms points from lidar to global frame using lidar_extrinsic_matrix
transformed_pcl = generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix)

Associate video data with LiDAR data to realize sensor fusion

KITTI provides LiDAR external and camera external matrix. We can use these matrices to extract attitude data, and then convert this part of data into JSON format according to the requirements of 3D point cloud sequence input Manifest.

For KITTI datasets, you can use pykitti Python module loads KITTI data. In the dataset oxts[i].T_w_imu is responsible for giving the LiDAR external _transformof frame I. We can convert it into translation and quaternion pointing forms, which represent the orientation and position of LiDAR in JSON format respectively. Please refer to the following code for details:

from scipy.spatial.transform import Rotation as R
# utility to convert extrinsic matrix to pose heading quaternion and position
def convert_extrinsic_matrix_to_trans_quaternion_mat(lidar_extrinsic_transform):
 position = lidar_extrinsic_transform[0:3, 3]
 rot = np.linalg.inv(lidar_extrinsic_transform[0:3, 0:3])
 quaternion= R.from_matrix(np.asarray(rot)).as_quat()
 trans_quaternions = {
 "translation": {
 "x": position[0],
 "y": position[1],
 "z": position[2]
 },
 "rotation": {
 "qx": quaternion[0],
 "qy": quaternion[1],
 "qz": quaternion[2],
 "qw": quaternion[3]
 }
 }
 return trans_quaternions

Similarly, we can also use camera external parameters to extract camera attitude data. We can calculate the camera of cam0 in frame I through inv (matmul (dataset. Calib. T_cam0_velo, inv (dataset. Oxts [i]. T_w_imu))_ extrinsic_ Transform, and then convert it into the orientation and position of cam0. See the following code for details:

def convert_camera_inv_extrinsic_matrix_to_trans_quaternion_mat(camera_extrinsic_transform):
 position = camera_extrinsic_transform[0:3, 3]
 rot = np.linalg.inv(camera_extrinsic_transform[0:3, 0:3])
 quaternion= R.from_matrix(np.asarray(rot)).as_quat()
 trans_quaternions = {
 "translation": {
 "x": position[0],
 "y": position[1],
 "z": position[2]
 },
 "rotation": {
 "qx": quaternion[0],
 "qy": quaternion[1],
 "qz": quaternion[2],
 "qw": -quaternion[3]
 }
 }
 return trans_quaternions

Camera calibration: internal and distortion

The KITTI dataset provides a calibration parameter for each camera. For example, data calib. K_ Cam0 contains the following camera internal matrix:

 fx 0 cx
 0 fy cy
 0 0 1

Create input Manifest file

After converting a series of frames in KITTI dataset into world coordinate system and extracting attitude information from LiDAR and camera external matrix, a 3D point cloud sequence Manifest file containing sensor fusion data can be created. We can automatically create a sequence file for the sequence input Manifest file using the following function:

def convert_to_gt():
 # The 'frames' argument is optional - default: None, which loads the whole dataset.
 # Calibration, timestamps, and IMU data are read automatically.
 # Camera and velodyne data are available via properties that create generators
 # when accessed, or through getter methods that provide random access.
 data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))
 image_paths = [data.cam0_files, data.cam1_files, data.cam2_files, data.cam3_files]
 camera_extrinsic_calibrations = [data.calib.T_cam0_velo, data.calib.T_cam1_velo, data.calib.T_cam2_velo, data.calib.T_cam3_velo]
 camera_intrinsics = [data.calib.K_cam0, data.calib.K_cam1, data.calib.K_cam2, data.calib.K_cam3]
 seq_json = {}
 seq_json["seq-no"] = 1
 seq_json["prefix"] = 's3://pdx-groundtruth-lidar-test-bucket/pdx-groundtruth-sequences/kittiseq2/frames/'
 seq_json["number-of-frames"] = len(data)
 seq_json["frames"] = []
 default_position = {"x": 0, "y": 0, "z": 0}
 default_heading = {"qx": 0, "qy": 0, "qz": 0, "qw": 1}
 for i in range(len(data)):
 # customer computes lidar extrinsics
 lidar_extrinsic_matrix = data.oxts[i].T_w_imu 
 # velodyne raw point cloud in lidar scanners own coordinate system
 points = data.get_velo(i)
 # customer transforms points from lidar to global frame using lidar_extrinsic_matrix
 transformed_pcl = generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix) 
 # Customer computes rotation quaternion and translation from LiDAR Extrinsic
 trans_quaternions = convert_extrinsic_matrix_to_trans_quaternion_mat(lidar_extrinsic_matrix)
 # Customer uses trans_quaternions to populates GT ego veicle pose
 ego_vehicle_pose = {}
 ego_vehicle_pose['heading'] = trans_quaternions['rotation']
 ego_vehicle_pose['position'] = trans_quaternions['translation']
 # open file to write the transformed pcl
 with open(output_base+"/"+str(i)+'.txt', "w") as out:
 writer = csv.writer(out, delimiter=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL)
 for point in transformed_pcl:
 writer.writerow((point[0], point[1], point[2], point[3]))
 frame = {}
 frame["frame-no"] = i
 frame["frame"] = str(i)+'.txt'
 frame["unix-timestamp"] = data.timestamps[i].replace(tzinfo=timezone.utc).timestamp()
 frame["ego-vehicle-pose"] = ego_vehicle_pose
 images = []
 image_dir_path = os.path.join(output_base, 'images')
 if not os.path.exists(image_dir_path):
 os.makedirs(image_dir_path)
 for j in range(len(image_paths)):
 # copy image
 image_path = image_paths[j][i]
 image_suffix_path = 'images/frame_'+str(i)+'_camera_'+str(j)+'.jpg'
 copyfile(image_path, os.path.join(output_base, image_suffix_path))
 # If customer has the camera extrinsics then they use them to compute the camera transform
 camera_transform= np.linalg.inv(np.matmul(camera_extrinsic_calibrations[j], np.linalg.inv(data.oxts[i].T_w_imu)))
 # Customer computes rotation quaternion and translation from camera inverse Extrinsic
 cam_trans_quaternions = convert_camera_inv_extrinsic_matrix_to_trans_quaternion_mat(camera_transform)
 image_json = {}
 image_json["image-path"] = image_suffix_path
 image_json["unix-timestamp"] = frame["unix-timestamp"]
 image_json['heading'] = cam_trans_quaternions['rotation']
 image_json['position'] = cam_trans_quaternions['translation']
 image_json['fx'] = camera_intrinsics[j][0][0]
 image_json['fy'] = camera_intrinsics[j][1][1]
 image_json['cx'] = camera_intrinsics[j][0][2]
 image_json['cy'] = camera_intrinsics[j][1][2]
 image_json['k1'] = 0
 image_json['k2'] = 0
 image_json['k3'] = 0
 image_json['k4'] = 0
 image_json['p1'] = 0
 image_json['p2'] = 0
 image_json['skew'] = 0
 images.append(image_json)
 frame["images"]=images
 seq_json["frames"].append(frame)
 with open(output_base+'/mykitti-seq2.json', 'w') as outfile:
 json.dump(seq_json, outfile)

Create tag job

After the input Manifest file is created, you can create a tag job with Notebook. When creating a tag job (refer to the relevant instructions in the Notebook), please use the internal work team to ensure that you can view each staff task in the staff portal at any time and interact with the staff tag UI.

The pre-processing time of the marked job is the time required to be displayed in the staff portal at the beginning of the task. The specific duration depends on the size of the input data, the resolution of the point cloud and the data used for sensor fusion (if any). Therefore, when using Notebook, the tag job may take several minutes to display in the staff portal. When the task appears, select it and select Start Working.

For more details about the staff UI, see 3D point cloud object tracking . The following are some auxiliary marking tools provided in the staff marking UI:

  • Automatic label filling: when the staff adds a frame to the frame, this function will automatically invite the frame back to all frames in the sequence. When the staff manually adjusts the annotation of the same object in other frames, the auto fill label will also be adjusted automatically.
  • Label interpolation: after the staff marks a single object in two frames, Ground Truth will use these annotations to interpolate the object in the movement between two frames.
  • Snap: workers can add a box around the object and use keyboard shortcuts or menu options to make the Ground Truth auto fit tool fit closely to the boundary of the object.
  • Adapt to the ground: after the staff adds a frame to the 3D scene, the staff can automatically adapt the frame to the ground. For example, workers can use this function to fit the frame to the road or sidewalk in the scene.
  • Multi view marking: after the staff adds the 3D frame to the 3D scene, the front and side perspective views will be displayed on the side panel to help the staff closely adjust the frame around the object. The staff can adjust the label on the 3D point cloud or side panel, and the corresponding adjustment will also be displayed in other views in real time.
  • Sensor fusion: if the sensor fusion data is provided, the staff can adjust the annotation in the 3D scene and 2D image, and the annotation will be projected to other views in real time.

The following video demonstrates the interpolation process. The staff only needs to add two rectangular boxes to the first and last frames in the sequence. These manually added boxes will be represented by red segments in the lower left navigation bar, and all labels in the middle (displayed in blue in the navigation bar) are inserted by Ground Truth.

The following figure shows the fusion of multi view markers and sensors. After adding the rectangular box to the 3D point cloud visualization, we can adjust it in three side views and camera images.

In addition to tags, you can also add tag attributes to collect other metadata on each object. In the following video, you can see the rectangular box with "Car" label. The tag also contains two tag attributes associated with it: Occlusion and Color. Staff can select corresponding values for each tag attribute.

Ground Truth automatically saves comments every 15 minutes. After the job is completed, select Submit. When the task is completed, the output data will be saved in the file specified in HumanTaskConfig Amazon Simple Storage Service (Amazon S3) in the bucket.

To learn more about the output data format, see in the Ground Truth output data page 3D point cloud object tracking part.

summary

In this experiment, we learned about the requirements and options of Ground Truth 3D point cloud marking job for input data, and tried to create an object tracking marking job. For other types of tasks that we can implement in 3D point cloud tagging jobs, see 3D point cloud task type . In addition, we would also like to thank KITTI team for providing us with this valuable data set to demonstrate how to prepare 3D point cloud data and introduce it into SageMaker Ground Truth.

*KITTI datasets have their own licenses. Please ensure that your use of the dataset fully complies with the terms and requirements set out therein.

Tags: Machine Learning

Posted by m0rpheu5 on Thu, 12 May 2022 00:50:09 +0300