CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation

Junchen Fu, Yongxin Ni, Joemon M. Jose, Ioannis Arapakis, Kaiwen Zheng, Youhua Li, Xuri Ge

Published: 2025/4/14

Abstract

In this paper, we explore a less-studied yet practically important problem: how to efficiently and effectively adapt multiple ($>$2) multimodal foundation models (MFMs) for the sequential recommendation task. To this end, we propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN), which leverages a fully decoupled side adapter-based paradigm to achieve efficient and scalable adaptation. Compared to the state-of-the-art efficient approaches, CROSSAN reduces training time by over 30%, GPU memory consumption by 20%, and trainable parameters by over 57%, while enabling effective cross-modal learning across diverse modalities. To further enhance multimodal fusion, we introduce the Mixture of Modality Expert Fusion (MOMEF) mechanism. Extensive experiments on public benchmarks demonstrate that CROSSAN consistently outperforms existing methods, achieving 6.7%--8.1% performance improvements when adapting four foundation models with raw modalities. Moreover, the overall performance continues to improve as more MFMs are incorporated. We will release our code and datasets to faciliate future research.

Read Full Paper (arXiv.org)