Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Shu, Dong; Zhao, Haiyan; Hu, Jingyu; Liu, Weiru; Cheng, Lu; Du, Mengnan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.01346 (cs)

[Submitted on 2 Jan 2025]

Title:Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Authors:Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Lu Cheng, Mengnan Du

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and linguistic representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.

Comments:	16 pages, 3 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2501.01346 [cs.CV]
	(or arXiv:2501.01346v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.01346

Submission history

From: Dong Shu [view email]
[v1] Thu, 2 Jan 2025 16:53:50 UTC (431 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators