融合Mamba-Transformer的节奏感知舞蹈生成模型

doi:10.16451/j.cnki.issn1003-6059.202602004

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (2958 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要针对音频驱动舞蹈生成任务中较难同时兼顾生成舞蹈动作质量及动作同音乐节拍对齐的问题,文中提出融合Mamba-Transformer的节奏感知舞蹈生成模型(Beat-Aware Dance Generation Model Integrating Mamba-Transformer, BeatDG).首先,设计上下肢动作特征编码网络,以无监督形式自主学习有意义的舞蹈单元,组成码本库.然后,为了有效提升音乐节拍提取能力,设计节拍特征提取模块,在保证计算效率的同时,注重音乐节拍与舞蹈动作的时序建模.在此基础上,构建节拍门控因果注意力模块,用于音乐信息与上下肢特征之间的信息交互.最后,设计基于Mamba-Transformer的节拍对齐舞蹈生成模块,同时考虑连续的帧间特征和全局信息,融合上下肢及音乐信息,生成符合空间标准和范式的舞蹈动作.在AIST++数据集上的实验表明,BeatDG在有效提升音乐节拍与舞蹈动作对齐程度的同时,可保证生成舞蹈动作的质量.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	胡正平
	徐传鑫
	董晓云
	吴一凡

关键词 ：深度学习, 舞蹈生成, 节拍对齐, 音乐驱动

Abstract：To address the challenge of simultaneously balancing both dance motion quality and beat alignment in audio-driven dance generation tasks, a beat-aware dance generation model integrating Mamba-Transformer(BeatDG) is proposed. First, an upper and lower limb motion feature encoding network is designed to autonomously learn a codebook of meaningful dance units in an unsupervised manner. Second, a beat feature extraction module is designed to effectively enhance music beat extraction capability. Therefore, the computational efficiency is ensured while the temporal modeling between music beats and dance motions is taken into account. On the basis of the above, a rhythm-gated temporal causal attention module is constructed to facilitate information interaction between music signals and upper and lower limb features. Finally, a hybrid generative architecture based on Dance Mamba and Transformer layers is designed to simultaneously consider continuous inter-frame features and global context. In this architecture, body and music binformation are fused and dance motions bconforming to spatial norms and paradigms are generated. Experiments on the AIST++ dataset demonstrate that BeatDG effectively improves the alignment between music beats and dance motions and ensures the quality of the generated dance.

Key words： Deep Learning Dance Generation Beat Alignment Music-Driven

收稿日期: 2025-11-10

ZTFLH:

TP 391.41

基金资助:国家自然科学基金项目(No.62001413)、河北省自然科学基金项目(No.F2024203069)资助

通讯作者: 胡正平,博士,教授,主要研究方向为深度视频学习模型.E-mail:zp@ysu.edu.cn.

作者简介: 徐传鑫,硕士研究生,主要研究方向为视频生成.E-mail:xuchuanxin0805@163.com.
董晓云,硕士研究生,主要研究方向为人体行为识别.E-mail:1506887239@qq.com.
吴一凡,硕士研究生,主要研究方向为拉班舞谱生成.E-mail:15507068379@163.com.

引用本文:

胡正平, 徐传鑫, 董晓云, 吴一凡. 融合Mamba-Transformer的节奏感知舞蹈生成模型[J]. 模式识别与人工智能, 2026, 39(2): 141-156. HU Zhengping, XU Chuanxin, DONG Xiaoyun, WU Yifan. Beat-Aware Dance Generation Model Integrating Mamba-Transformer. Pattern Recognition and Artificial Intelligence, 2026, 39(2): 141-156.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202602004 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2026/V39/I2/141

[1] WANG Z L, ZHUANG H L, LI L, et al. Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(1): 301-309.
[2] LI R H, ZHAO J F, ZHANG Y C, et al. Finedance: A Fine-Grained Choreography Dataset for 3D Full Body Dance Generation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 10234-10243.
[3] 黄金杰,刘彬.基于双重优化稳定扩散模型的文本生成图像方法.模式识别与人工智能, 2025, 38(4): 359-373.
(HUANG J J, LIU B.Text-to-Image Generation via Dual Optimized Stable Diffusion Model. Pattern Recognition and Artificial Intelligence, 2025, 38(4): 359-373.)
[4] 吴志泽,陈盛,檀明,等.基于跨通道特征增强图卷积网络的骨架行为识别.模式识别与人工智能, 2024, 37(8): 703-714.
(WU Z Z, CHEN S, TAN M, et al. Cross-Channel Feature-Enhanced Graph Convolutional Network for Skeleton-Based Action Re-cognition. Pattern Recognition and Artificial Intelligence, 2024, 37(8): 703-714.)
[5] QI Q S, ZHUO L, ZHANG A X, et al. DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 1374-1382.
[6] LI R L, YANG S, ROSS D A, et al. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 13381-13392.
[7] LI S Y, YU W J, GU T P, et al. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11040-11049.
[8] WANG J W, MAO W, LIU M M.MIDGET: Music Conditioned 3D Dance Generation // Proc of the 36th Australasian Joint Conference on Artificial Intelligence. Berlin, Germany: Springer, 2023: 277-288.
[9] HUANG Z K, XU X M, XU C, et al. Beat-It: Beat-Synchronized Multi-condition 3D Dance Generation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 273-290.
[10] HUANG Q C, HE X, TANG B S, et al. Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2024: 8185-8189.
[11] LI Y Z, YUAN R B, ZHANG G, et al. MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training[C/OL].[2025-12-17]. https://arxiv.org/pdf/2306.00107.
[12] SHIRATORI T, NAKAZAWA A, IKEUCHI K.Synthesizing Dance Performance Using Musical and Motion Features // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2006: 3654-3659.
[13] OFLI F, ERZIN E, YEMEZ Y, et al. Learn2Dance: Learning Sta-tistical Music-to-Dance Mappings for Choreography Synthesis. IEEE Transactions on Multimedia, 2012, 14(3): 747-759.
[14] LEE M, LEE K, PARK J.Music Similarity-Based Approach to Generating Dance Motion Sequence. Multimedia Tools and Applications, 2013, 62(3): 895-912.
[15] LI S Y, YU W J, GU T P, et al. Bailando++: 3D Dance GPT with Choreographic Memory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 14192-14207.
[16] TSENG J, CASTELLON R, LIU C K.EDGE: Editable Dance Generation from Music // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 448-458.
[17] LI R H, ZHANG Y X, ZHANG Y C, et al. Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 1524-1534.
[18] MARCHELLUS M, PARK I K.M2C: Concise Music Representation for 3D Dance Generation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 3118-3127.
[19] FAN C Y, GUAN J, ZHAO X J, et al. Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation[C/OL].[2025-12-17]. https://arxiv.org/pdf/2503.17340.
[20] TANG T R, JIA J, MAO H Y.Dance with Melody: An LSTM-Autoencoder Approach to Music-Oriented Dance Synthesis // Proc of the 26th ACM International Conference on Multimedia. New York, USA: ACM, 2018: 1598-1606.
[21] ZHOU L, LUO Y.A Spatio-Temporal Learning for Music Conditioned Dance Generation // Proc of the International Conference on Multimodal Interaction. New York, USA: ACM, 2022: 57-62.
[22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5998-6008.
[23] GONG K H, LIAN D Z, CHANG H, et al. TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 9942-9952.
[24] ZHUANG H L, LEI S, XIAO L, et al. GTN-BAILANDO: Genre Consistent Long-Term 3D Dance Generation Based on Pre-trained Genre Token Network // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10095203.
[25] GU A, DAO T.Mamba: Linear-Time Sequence Modeling with Selective State Spaces[C/OL]. [2025-12-17]. https://openreview.net/pdf?id=tEYskw1VY2
[26] ZHANG Z Y, LIU A, REID I, et al. Motion Mamba: Efficient and Long Sequence Motion Generation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 265-282.
[27] ZHANG Z Y, LIU A, CHEN Q, et al. InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation[C/OL].[2025-12-17]. https://arxiv.org/pdf/2407.10061.
[28] LI C J, SHU X B, CUI Q J, et al. FTMoMamba: Motion Generation with Frequency and Text State Space Models[C/OL].[2025-12-17]. https://arxiv.org/pdf/2411.17532.
[29] QIAN Z Y, XIAO Z Y, JIN X L, et al. SMCD: High Realism Motion Style Transfer via Mamba-Based Diffusion[C/OL].[2025-12-17]. https://arxiv.org/pdg/2405.02844v1.
[30] ZHOU Y, LI Z M, XIAO S J, et al. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis[C/OL].[2025-12-17]. https://arxiv.org/pdf/1707.05363.
[31] ZHUANG W L, WANG C Y, CHAI J X, et al. Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Transactions on Multimedia Computing, Communications, and Application, 2022, 18(2). DOI: 10.1145/348566.