The Power of Context: How Multimodality Improves Image Super-Resolution

Mei, Kangfu; Talebi, Hossein; Ardakani, Mojtaba; Patel, Vishal M.; Milanfar, Peyman; Delbracio, Mauricio

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.14503 (cs)

[Submitted on 18 Mar 2025]

Title:The Power of Context: How Multimodality Improves Image Super-Resolution

Authors:Kangfu Mei, Hossein Talebi, Mojtaba Ardakani, Vishal M. Patel, Peyman Milanfar, Mauricio Delbracio

View PDF HTML (experimental)

Abstract:Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.

Comments:	accepted by CVPR2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2503.14503 [cs.CV]
	(or arXiv:2503.14503v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.14503

Submission history

From: Kangfu Mei [view email]
[v1] Tue, 18 Mar 2025 17:59:54 UTC (43,681 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Power of Context: How Multimodality Improves Image Super-Resolution

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Power of Context: How Multimodality Improves Image Super-Resolution

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators