Action Recognition Based on 3D Skeleton and RGB Frame Fusion

Guiyu Liu¹, Jiuchao Qian², Fei Wen², Xiaoguang Zhu², Rendong Ying², Peilin Liu²

¹Shanghai JitaoTong University
²Shanghai Jiao Tong University

Details

11:15 - 11:30 | Tue 5 Nov | L1-R7 | TuAT7.2

Session: Computer Vision and Applications I

Full Text

Abstract

Action recognition has wide applications in assisted living, health monitoring, surveillance, and human-computer interaction. In traditional action recognition methods, RGB video-based ones are effective but computationally inefficient, while skeleton-based ones are computationally efficient but do not make use of low-level detail information. This work considers action recognition based on a multimodal fusion between the 3D skeleton and the RGB image. We design a neural network that uses a 3D skeleton sequence and a single middle frame from an RGB video as input. Specifically, our method picks up one frame in a video and extracts spatial features from it using two attention modules, a self-attention module, and a skeleton-attention module. Further, temporal features are extracted from the skeleton sequence via a BI-LSTM subnetwork. Finally, the spatial features and the temporal features are combined via a feature fusion network for action classification. A distinct feature of our method is that it uses only a single RGB frame rather than an RGB video. Accordingly, it has a light-weighted architecture and is more efficient than RGB video-based methods. Comparative evaluation on two public datasets, NTU-RGBD and SYSU, demonstrates that our method can achieve competitive performance compared with state-of-the-art methods.