|
|
|
| Multimodal Recommendation with User Semantic Embedding Refinement |
| XU Hao1, XIA Hongbin1,2, WANG Xiaofeng1,3 |
1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122; 2. Jiangsu Key University Laboratory of Software and Media Tech-nology under Human-Computer Cooperation, Jiangnan Univer-sity, Wuxi 214122; 3. Pengcheng Laboratory, Shenzhen 518055 |
|
|
|
|
Abstract Existing multimodal recommendation methods typically extract features of the different modalities separately, such as images and texts, and only shallow fusion is performed during training. Therefore, it is difficult to fully explore cross-modal semantics. Moreover, mainstream methods mostly adopt randomly initialized user representations, resulting in insufficient discriminability among users. To address these issues, a multimodal recommendation method with user semantic embedding refinement(USERec) is proposed in this paper. The problems are alleviated from the perspectives of both the item and the user. On the item side, a multimodal large language model is utilized to achieve deep semantic fusion by guiding visual feature extraction with textual information. Thus, more suitable item representations for recommendation tasks can be obtained. On the user side, positional encoding is introduced into user representations to enhance the spectral diversity of the user index space. Personalized local graphs are then constructed through degree-sensitive pruning, and the global awareness of users is augmented via a randomly sampled attention mechanism, thereby improving the discriminability of user representations. Experiments on four real-world datasets verify the effectiveness of USERec.
|
|
Received: 14 October 2025
|
|
|
| Fund:National Natural Science Foundation of China(No.61972182) |
|
Corresponding Authors:
XIA Hongbin, Ph.D., professor. His research interests include personalized recommendation, natural language processing, and computer networks.
|
About author:: XU Hao, Master student. His research interests include recommendation systems and deep learning. WANG Xiaofeng, Ph.D., professor. His research interests include computer networks. |
|
|
|
[1] ADOMAVICIUS G, TUZHILIN A.Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Po-ssible Extensions. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 734-749. [2] GAO C, ZHENG Y, LI N, et al. A Survey of Graph Neural Networks for Recommender Systems: Challenges, Methods, and Directions. ACM Transactions on Recommender Systems, 2023, 1(1). DOI: 10.1145/356802. [3] ZHOU H Y, ZHOU X, ZENG Z W, et al. A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions[C/OL].[2025-09-17]. https://arxiv.org/pdf/2302.04473. [4] WEI Y W, WANG X, NIE L Q, et al. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-Video // Proc of the 27th ACM International Conference on Multimedia. New York, USA: ACM, 2019: 1437-1445. [5] HE R N, MCAULEY J.VBPR: Visual Bayesian Personalized Ran-king from Implicit Feedback. Proc of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 144-150. [6] WANG Q F, WEI Y W, YIN J H, et al. DualGNN: Dual Graph Neural Network for Multimedia Recommendation. IEEE Transactions on Multimedia, 2023, 25: 1074-1084. [7] ZHANG J H, ZHU Y Q, LIU Q, et al. Mining Latent Structures for Multimedia Recommendation // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 3872-3880. [8] ZHOU X, SHEN Z Q.A Tale of Two Graphs: Freezing and Denoi-sing Graph Structures for Multimodal Recommendation // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 935-943. [9] TAO Z L, LIU X H, XIA Y W, et al. Self-Supervised Learning for Multi-media Recommendation. IEEE Transactions on Multimedia, 2023, 25: 5107-5116. [10] ZHOU X, ZHOU H Y, LIU Y, et al. Bootstrap Latent Representations for Multi-modal Recommendation // Proc of the ACM Web Conference. New York, USA: ACM, 2023: 845-854. [11] YU P H, TAN Z Y, LU G M, et al. Multi-view Graph Convolutional Network for Multimedia Recommendation // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 6576-6585. [12] ZHOU H Y, ZHOU X, ZHANG L Z, et al. Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2301.12097. [13] JIANG Y Q, XIA L H, WEI W, et al. DiffMM: Multi-modal Diffusion Model for Recommendation // Proc of the 32nd ACM International Conference on Multimedia. New York, USA: ACM, 2024: 7591-7599. [14] SU H Z, LI J J, LI F L, et al. SOIL: Contrastive Second-Order Interest Learning for Multimodal Recommendation // Proc of the 32nd ACM International Conference on Multimedia. New York, USA: ACM, 2024: 5838-5846. [15] XU J F, CHEN Z Y, WANG W, et al. COHESION: Composite Graph Convolutional Network with Dual-Stage Fusion for Multi-modal Recommendation // Proc of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2025: 1830-1839. [16] YU P H, TAN Z Y, LU G M, et al. Mind Individual Information! Principal Graph Learning for Multimedia Recommendation. Proc of the AAAI Conference on Artificial Intelligence, 2025, 39(12): 13096-13105. [17] LI G H, JING L, WU J, et al. From ID-Based to ID-Free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2507.05715v1. [18] ONG R K, KHONG A W H. Spectrum-Based Modality Representation Fusion Graph Convolutional Network for Multimodal Reco-mmendation // Proc of the 18th ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2025: 773-781. [19] ZHOU X, ZHANG X X, NIYATO D, et al. Learning Item Representations Directly from Multimodal Features for Effective Reco-mmendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2505.04960. [20] WU X L, HUANG A F, YANG H W, et al. Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2407.05420. [21] POMO C, ATTIMONELLI M, DANESE D, et al. Do Recommender Systems Really Leverage Multimodal Content a Comprehensive Analysis on Multimodal Representations for Recommendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2508.04571. [22] ABDI H, WILLIAMS L J. Principal Component Analysis. WIREs Computational Statistics, 2010, 2(4): 433-459. [23] XU J F, CHEN Z Y, YANG S, et al. The Best Is Yet to Come: Graph Convolution in the Testing Phase for Multimodal Recommendation[C/OL].[2025-09-17]. https://arxiv.org/pdf/2507.18489. [24] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [25] XÜ G P, LI X Y, XIE R B, et al. Improving Multi-modal Reco-mmender Systems by Denoising and Aligning Multi-modal Content and User Feedback // Proc of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2024: 3645-3656. [26] HE X N, DENG K, WANG X, et al. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation // Proc of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2020: 639-648. [27] HU J, HOOI B, HE B S, et al. Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation. Proc of the AAAI Conference on Artificial Intelligence, 2025, 39(11): 11790-11798. [28] RENDLE S, FREUDENTHALER C, GANTNER Z, et al. BPR: Bayesian Personalized Ranking from Implicit Feedback // Proc of the 25th Conference on Uncertainty in Artificial Intelligence. New York, USA: ACM, 2009: 452-461. [29] YAO T S, YI X Y, CHENG D Z, et al. Self-Supervised Learning for Large-Scale Item Recommendations // Proc of the 30th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2021: 4321-4330. [30] NI J M, LI J C, MCAULEY J.Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, USA: ACL, 2019: 188-197. [31] XIAOMI LLM-CORE.MiMo: Unlocking the Reasoning Potential of Language Model-From Pretraining to Posttraining[C/OL]. [2025-09-17].https://arxiv.org/pdf/2505.07608. [32] ZHOU X.MMRec: Simplifying Multimodal Recommendation // Proc of the 5th ACM International Conference on Multimedia in Asia Workshops. New York, USA: ACM, 2023. DOI: 10.1145/3611380.3628561. [33] GLOROT X, BENGIO Y.Understanding the Difficulty of Training Deep Feedforward Neural Networks // Proc of the 13th Internatio-nal Conference on Artificial Intelligence and Statistics. San Diego, USA: JMLR, 2010: 249-256. [34] GUO Z Q, LI J J, LI G H, et al. LGMRec: Local and Global Graph Learning for Multimodal Recommendation. Proc of the AAAI Conference on Artificial Intelligence, 2024, 38(8): 8454-8462. [35] LIN G J, MENG Z, WANG D J, et al. GUME: Graphs and User Modalities Enhancement for Long-Tail Multimodal Recommendation // Proc of the 33rd ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2024: 1400-1409. [36] ZHANG B C, ZHANG P, DONG X Y, et al. Long-CLIP: Unlo-cking the Long-Text Capability of CLIP // Proc of the 18th Euro-pean Conference on Computer Vision. Berlin, Germany: Springer, 2024: 310-325. |
|
|
|