Chen, Tao and Gu, Dongbing (2022) CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation. Cognitive Computation, 14 (2). pp. 702-713. DOI https://doi.org/10.1007/s12559-021-09966-y
Chen, Tao and Gu, Dongbing (2022) CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation. Cognitive Computation, 14 (2). pp. 702-713. DOI https://doi.org/10.1007/s12559-021-09966-y
Chen, Tao and Gu, Dongbing (2022) CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation. Cognitive Computation, 14 (2). pp. 702-713. DOI https://doi.org/10.1007/s12559-021-09966-y
Abstract
6D object pose estimation plays a crucial role in robotic manipulation and grasping tasks. The aim to estimate the 6D object pose from RGB or RGB-D images is to detect objects and estimate their orientations and translations relative to the given canonical models. RGB-D cameras provide two sensory modalities: RGB and depth images, which could benefit the estimation accuracy. But the exploitation of two different modality sources remains a challenging issue. In this paper, inspired by recent works on attention networks that could focus on important regions and ignore unnecessary information, we propose a novel network: Channel-Spatial Attention Network (CSA6D) to estimate the 6D object pose from RGB-D camera. The proposed CSA6D includes a pre-trained 2D network to segment the interested objects from RGB image. Then it uses two separate networks to extract appearance and geometrical features from RGB and depth images for each segmented object. Two feature vectors for each pixel are stacked together as a fusion vector which is refined by an attention module to generate a aggregated feature vector. The attention module includes a channel attention block and a spatial attention block which can effectively leverage the concatenated embeddings into accurate 6D pose prediction on known objects. We evaluate proposed network on two benchmark datasets <jats:bold>YCB-Video</jats:bold> dataset and <jats:bold>LineMod</jats:bold> dataset and the results show it can outperform previous state-of-the-art methods under <jats:bold>ADD</jats:bold> and <jats:bold>ADD-S</jats:bold> metrics. Also, the attention map demonstrates our proposed network searches for the unique geometry information as the most likely features for pose estimation. From experiments, we conclude that the proposed network can accurately estimate the object pose by effectively leveraging multi-modality features.
Item Type: | Article |
---|---|
Divisions: | Faculty of Science and Health Faculty of Social Sciences Faculty of Science and Health > Computer Science and Electronic Engineering, School of Faculty of Social Sciences > Sociology and Criminology, Department of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 14 Dec 2021 17:25 |
Last Modified: | 16 May 2024 21:02 |
URI: | http://repository.essex.ac.uk/id/eprint/31761 |
Available files
Filename: Chen-Gu2021_Article_CSA6DChannel-SpatialAttentionN.pdf
Licence: Creative Commons: Attribution 3.0