Multi-Modal Learning (MML) aims to learn effective representations across modalities for accurate predictions. Existing methods typically focus on modality consistency and specificity to learn effective representations. However, from a causal perspective, they may lead to representations that contain insufficient and unnecessary information. To address this, we propose that effective MML representations should be causally sufficient and necessary. Considering practical issues like spurious correlations and modality conflicts, we relax the exogeneity and monotonicity assumptions prevalent in prior works and explore the concepts specific to MML, i.e., Causal Complete Cause C3. We begin by defining C3, which quantifies the probability of representations being causally sufficient and necessary. We then discuss the causal identifiability of C3 and introduce an instrumental variable to support identifying C3 with non-exogeneity and non-monotonicity. Building on this, we conduct the C3 measurement, i.e., C3 risk. We propose a twin network to estimate it through (i) the real-world branch: utilizing the instrumental variable for sufficiency, and (ii) the hypothetical-world branch: applying gradient-based counterfactual modeling for necessity. Theoretical analyses confirm its reliability. Based on these results, we propose C3 Regularization, a plug-and-play method that enforces the causal completeness of the learned representations by minimizing C3 risk. Extensive experiments demonstrate its effectiveness.
@misc{wang2025causal,
title={Towards the Causal Complete Cause of Multi-Modal Representation Learning},
author={Jingyao Wang and Siyu Zhao and Wenwen Qiang and Jiangmeng Li and Changwen Zheng and Fuchun Sun and Hui Xiong},
year={2025},
eprint={2407.14058},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.14058},
}