Beat-Aware Dance Generation Model Integrating Mamba-Transformer
HU Zhengping1,2, XU Chuanxin1, DONG Xiaoyun1, WU Yifan1
1. School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004; 2. Hebei Key Laboratory of Information Transmission and Signal Processing, Yanshan University, Qinhuangdao 066004
Abstract:To address the challenge of simultaneously balancing both dance motion quality and beat alignment in audio-driven dance generation tasks, a beat-aware dance generation model integrating Mamba-Transformer(BeatDG) is proposed. First, an upper and lower limb motion feature encoding network is designed to autonomously learn a codebook of meaningful dance units in an unsupervised manner. Second, a beat feature extraction module is designed to effectively enhance music beat extraction capability. Therefore, the computational efficiency is ensured while the temporal modeling between music beats and dance motions is taken into account. On the basis of the above, a rhythm-gated temporal causal attention module is constructed to facilitate information interaction between music signals and upper and lower limb features. Finally, a hybrid generative architecture based on Dance Mamba and Transformer layers is designed to simultaneously consider continuous inter-frame features and global context. In this architecture, body and music binformation are fused and dance motions bconforming to spatial norms and paradigms are generated. Experiments on the AIST++ dataset demonstrate that BeatDG effectively improves the alignment between music beats and dance motions and ensures the quality of the generated dance.
[1] WANG Z L, ZHUANG H L, LI L, et al. Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(1): 301-309. [2] LI R H, ZHAO J F, ZHANG Y C, et al. Finedance: A Fine-Grained Choreography Dataset for 3D Full Body Dance Generation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 10234-10243. [3] 黄金杰,刘彬.基于双重优化稳定扩散模型的文本生成图像方法.模式识别与人工智能, 2025, 38(4): 359-373. (HUANG J J, LIU B.Text-to-Image Generation via Dual Optimized Stable Diffusion Model. Pattern Recognition and Artificial Intelligence, 2025, 38(4): 359-373.) [4] 吴志泽,陈盛,檀明,等.基于跨通道特征增强图卷积网络的骨架行为识别.模式识别与人工智能, 2024, 37(8): 703-714. (WU Z Z, CHEN S, TAN M, et al. Cross-Channel Feature-Enhanced Graph Convolutional Network for Skeleton-Based Action Re-cognition. Pattern Recognition and Artificial Intelligence, 2024, 37(8): 703-714.) [5] QI Q S, ZHUO L, ZHANG A X, et al. DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 1374-1382. [6] LI R L, YANG S, ROSS D A, et al. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 13381-13392. [7] LI S Y, YU W J, GU T P, et al. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11040-11049. [8] WANG J W, MAO W, LIU M M.MIDGET: Music Conditioned 3D Dance Generation // Proc of the 36th Australasian Joint Conference on Artificial Intelligence. Berlin, Germany: Springer, 2023: 277-288. [9] HUANG Z K, XU X M, XU C, et al. Beat-It: Beat-Synchronized Multi-condition 3D Dance Generation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 273-290. [10] HUANG Q C, HE X, TANG B S, et al. Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2024: 8185-8189. [11] LI Y Z, YUAN R B, ZHANG G, et al. MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training[C/OL].[2025-12-17]. https://arxiv.org/pdf/2306.00107. [12] SHIRATORI T, NAKAZAWA A, IKEUCHI K.Synthesizing Dance Performance Using Musical and Motion Features // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2006: 3654-3659. [13] OFLI F, ERZIN E, YEMEZ Y, et al. Learn2Dance: Learning Sta-tistical Music-to-Dance Mappings for Choreography Synthesis. IEEE Transactions on Multimedia, 2012, 14(3): 747-759. [14] LEE M, LEE K, PARK J.Music Similarity-Based Approach to Generating Dance Motion Sequence. Multimedia Tools and Applications, 2013, 62(3): 895-912. [15] LI S Y, YU W J, GU T P, et al. Bailando++: 3D Dance GPT with Choreographic Memory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 14192-14207. [16] TSENG J, CASTELLON R, LIU C K.EDGE: Editable Dance Generation from Music // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 448-458. [17] LI R H, ZHANG Y X, ZHANG Y C, et al. Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 1524-1534. [18] MARCHELLUS M, PARK I K.M2C: Concise Music Representation for 3D Dance Generation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 3118-3127. [19] FAN C Y, GUAN J, ZHAO X J, et al. Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation[C/OL].[2025-12-17]. https://arxiv.org/pdf/2503.17340. [20] TANG T R, JIA J, MAO H Y.Dance with Melody: An LSTM-Autoencoder Approach to Music-Oriented Dance Synthesis // Proc of the 26th ACM International Conference on Multimedia. New York, USA: ACM, 2018: 1598-1606. [21] ZHOU L, LUO Y.A Spatio-Temporal Learning for Music Conditioned Dance Generation // Proc of the International Conference on Multimodal Interaction. New York, USA: ACM, 2022: 57-62. [22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5998-6008. [23] GONG K H, LIAN D Z, CHANG H, et al. TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 9942-9952. [24] ZHUANG H L, LEI S, XIAO L, et al. GTN-BAILANDO: Genre Consistent Long-Term 3D Dance Generation Based on Pre-trained Genre Token Network // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10095203. [25] GU A, DAO T.Mamba: Linear-Time Sequence Modeling with Selective State Spaces[C/OL]. [2025-12-17]. https://openreview.net/pdf?id=tEYskw1VY2 [26] ZHANG Z Y, LIU A, REID I, et al. Motion Mamba: Efficient and Long Sequence Motion Generation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 265-282. [27] ZHANG Z Y, LIU A, CHEN Q, et al. InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation[C/OL].[2025-12-17]. https://arxiv.org/pdf/2407.10061. [28] LI C J, SHU X B, CUI Q J, et al. FTMoMamba: Motion Generation with Frequency and Text State Space Models[C/OL].[2025-12-17]. https://arxiv.org/pdf/2411.17532. [29] QIAN Z Y, XIAO Z Y, JIN X L, et al. SMCD: High Realism Motion Style Transfer via Mamba-Based Diffusion[C/OL].[2025-12-17]. https://arxiv.org/pdg/2405.02844v1. [30] ZHOU Y, LI Z M, XIAO S J, et al. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis[C/OL].[2025-12-17]. https://arxiv.org/pdf/1707.05363. [31] ZHUANG W L, WANG C Y, CHAI J X, et al. Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Transactions on Multimedia Computing, Communications, and Application, 2022, 18(2). DOI: 10.1145/348566.