Multimodal Fusion-Based Semantic Transmission for Road Object Detection
ZHU Zengle1, WEI Zhiwei2, ZHANG Rongqing3, YANG Liuqing1
1. Intelligent Transportation Thrust, The Hong Kong University of Science and Technology(Guangzhou), Guangzhou 511455; 2. Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai 201210; 3. School of Software Engineering, Tongji University, Shanghai 201804
Abstract:In extreme scenarios with long-tail effects, collaborative perception involving multiple vehicles and sensors can provide effective sensory information for vehicles. However, the differentiation in heterogeneous data, coupled with bandwidth constraints and diverse data formats, makes it challenging for vehicles to achieve unified and efficient scheduling in processing. To organically integrate multi-sensor information among different vehicles under limited communication bandwidth, a semantic communication framework for multimodal fusion object detection based on Transformer is proposed in this paper. Unlike traditional data transmission solutions, self-attention mechanisms are utilized in the proposed framework to fuse data from different modalities, focusing on exploring the semantic correlation and dependencies among modal data. It helps vehicles transmit information and collaborate under limited communication resources, thereby enhancing their understanding of complex road conditions. The experimental results on Teledyne FLIR Free ADAS Thermal dataset show that the proposed model performs well in multimodal object detection semantic communication tasks with accuracy of object detection significantly improved and transmission costs reduced by half.
[1] VAHDAT-NEJAD H, RAMAZANI A, MOHAMMADI T, et al. A Survey on Context-Aware Vehicular Network Applications. Vehicular Communications, 2016, 3: 43-57. [2] WEI Z W, LI B, ZHANG R Q, et al. Many-to-Many Task Offloa-ding in Vehicular Fog Computing: A Multi-agent Deep Reinforcement Learning Approach. IEEE Transactions on Mobile Computing, 2023. DOI: 10.1109/TMC.2023.3250495. [3] CHO H, SEO Y W, KUMAR B V K V, et al. A Multi-sensor Fusion System for Moving Object Detection and Tracking in Urban Driving Environments // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2014: 1836-1843. [4] LI B, ZHANG T L, XIA T.Vehicle Detection from 3D Lidar Using Fully Convolutional Network[C/OL]. [2023-09-11]. https://arxiv.org/pdf/1608.07916.pdf [5] DUMITRASCU B, FILIPESCU A, PETREA G, et al. Laser-Based Obstacle Avoidance Algorithm for Four Driving/Steering Wheels Autonomous Vehicle // Proc of the 17th International Conference on System Theory, Control and Computing. Washington, USA: IEEE, 2013: 187-192. [6] CHEN K H, TSAI W H.Vision-Based Obstacle Detection and Avoi-dance for Autonomous Land Vehicle Navigation in Outdoor Roads. Automation in Construction, 2000, 10(1): 1-25. [7] CALCROFT M, KHAN A.LiDAR-Based Obstacle Detection and Avoidance for Autonomous Vehicles Using Raspberry Pi 3B // Proc of the 13th International Conference on Control. Washington, USA: IEEE, 2022: 24-29. [8] JOHN V, MITA S.RVNet: Deep Sensor Fusion of Monocular Camera and Radar for Image-Based Obstacle Detection in Challenging Environments // Proc of the 9th Pacific-Rim Symposium on Image and Video Technology. Berlin, Germany: Springer, 2019: 351-364. [9] KUMAR A D, KARTHIKA R, SOMAN K P.Stereo Camera and LIDAR Sensor Fusion-Based Collision Warning System for Autonomous Vehicles // JAIN S, SOOD M, PAUL S, eds. Advances in Compu-tational Intelligence Techniques. Berlin, Germany: Springer, 2020: 239-252. [10] ZHANG F H, CLARKE D, KNOLL A.Vehicle Detection Based on LiDAR and Camera Fusion // Proc of the 17th International IEEE Conference on Intelligent Transportation Systems. Washington, USA: IEEE, 2014: 1620-1625. [11] DARMS M, FOELSTER F, SCHMIDT J, et al. Data Fusion Stra-tegies in Advanced Driver Assistance Systems. SAE International Journal of Passenger Cars-Electronic and Electrical Systems, 2010, 3(2): 176-182. [12] YANG F C, WANG S G, LI J L, et al. An Overview of Internet of Vehicles. China Communications, 2014, 11(10): 1-15. [13] WEI Z W, LI B, ZHANG R Q, et al. OCVC: An Overlapping-Enabled Cooperative Vehicular Fog Computing Protocol. IEEE Transactions on Mobile Computing, 2023, 22(12): 7406-7419. [14] ABDEL-AZIZ M K, SAMARAKOON S, LIU C F, et al.Optimized Age of Information Tail for Ultra-Reliable Low-Latency Co-mmunications in Vehicular Networks. IEEE Transactions on Co-mmunications, 2020, 68(3): 1911-1924. [15] LUO X W, CHEN H H, GUO Q.Semantic Communications:Over-view, Open Issues, and Future Research Directions. IEEE Wireless Communications, 2022, 29(1): 210-219. [16] JIN Z Z, ZHENG Y F.Research on Application of Improved YOLO V3 Algorithm in Road Target Detection. Journal of Physics: Conference Series, 2020, 1654. DOI: 10.1088/1742-6596/1654/1/012060. [17] DALAL N, TRIGGS B.Histograms of Oriented Gradients for Human Detection // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2005: 886-893. [18] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D, et al. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1627-1645. [19] 曹行健,张志涛,孙彦赞,等.面向智慧交通的图像处理与边缘计算.中国图象图形学报,2022, 27(6): 1743-1767. (CAO X J, ZHANG Z T, SUN Y Z, et al. The Review of Image Processing and Edge Computing for Intelligent Transportation System. Journal of Image and Graphics, 2022, 27(6): 1743-1767.) [20] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [21] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 779-788. [22] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot Mul-tibox Detector // Proc of the 14th European Conference on Compu-ter Vision. Berlin, Germany: Springer, 2016: 21-37. [23] KIM Y, HWANG H, SHIN J.Robust Object Detection under Harsh Autonomous-Driving Environments. IET Image Processing, 2022, 16(4): 958-971. [24] BALTRUŠAITIS T, AHUJA C, MORENCY L P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443. [25] ATREY P K, HOSSAIN M A, EL SADDIK A, et al. Multimodal Fusion for Multimedia Analysis: A Survey. Multimedia Systems, 2010, 16: 345-379. [26] BARNUM G, TALUKDER S, YUE Y S.On the Benefits of Early Fusion in Multimodal Representation Learning[C/OL]. [2023-09-11].https://arxiv.org/abs/2011.07191. [27] NAGRANI A, YANG S, ARNAB A, et al. Attention Bottlenecks for Multimodal Fusion[C/OL].[2023-09-11]. https://arxiv.org/pdf/2107.00135.pdf. [28] PANDEYA Y R, LEE J.Deep Learning-Based Late Fusion of Multimodal Information for Emotion Classification of Music Video. Multimedia Tools and Applications, 2021, 80: 2887-2905. [29] SAHU G, VECHTOMOVA O.Dynamic Fusion for Multimodal Data[C/OL]. [2023-09-11].https://arxiv.org/pdf/1911.03821v1.pdf. [30] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [31] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186. [32] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2023-09-11]. https://arxiv.org/pdf/2010.11929.pdf. [33] CHENG B W, SCHWING A G, KIRILLOV A.Per-Pixel Classification Is not All You Need for Semantic Segmentation[C/OL]. [2023-09-11].https://arxiv.org/pdf/2107.06278v2.pdf. [34] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229. [35] 尉志青,马昊,张奇勋,等.感知-通信-计算融合的智能车联网挑战与趋势.中兴通讯技术, 2020, 26(1): 45-49. (WEI Z Q, MA H, ZHANG Q X, et al. The Challenge and Trend of Sensing, Communication and Computing Integrated Intelligent Internet of Vehicles. ZTE Technology Journal, 2020, 26(1): 45-49.) [36] 罗薇,汪梦珍,许玲.车联网高层协议关键技术.中兴通讯技术, 2020, 26(1): 35-39. (LUO W, WANG M Z, XU L.Main Technologies Adopted in High Layer Protocol for Internet of Vehicles. ZTE Technology Journal, 2020, 26(1): 35-39.) [37] XIE H Q, QIN Z J, LI G Y, et al. Deep Learning Enabled Semantic Communication Systems. IEEE Transactions on Signal Processing, 2021, 69: 2663-2675. [38] WENG Z Z, QIN Z J.Semantic Communication Systems for Speech Transmission. IEEE Journal on Selected Areas in Communications, 2021, 39(8): 2434-2444. [39] XIE H Q, QIN Z J.A Lite Distributed Semantic Communication System for Internet of Things. IEEE Journal on Selected Areas in Communications, 2020, 39(1): 142-153. [40] XIE H Q, QIN Z J, LI G Y.Task-Oriented Multi-user Semantic Communications for VQA. IEEE Wireless Communications Letters, 2021, 11(3): 553-557. [41] ZHANG G Y, HU Q Y, QIN Z J, et al. A Unified Multi-task Semantic Communication System for Multimodal Data[C/OL].[2023-09-11]. https://arxiv.org/pdf/2209.07689.pdf.