Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

Song, Juan; Yang, Lijie; Feng, Mingtao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.00399 (cs)

[Submitted on 1 Mar 2025]

Title:Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

Authors:Juan Song, Lijie Yang, Mingtao Feng

View PDF HTML (experimental)

Abstract:It remains a significant challenge to compress images at ultra-low bitrate while achieving both semantic consistency and high perceptual quality. We propose a novel image compression framework, Semantically Disentangled Image Compression (SEDIC) in this paper. Our proposed SEDIC leverages large multimodal models (LMMs) to disentangle the image into several essential semantic information, including an extremely compressed reference image, overall and object-level text descriptions, and the semantic masks. A multi-stage semantic decoder is designed to progressively restore the transmitted reference image object-by-object, ultimately producing high-quality and perceptually consistent reconstructions. In each decoding stage, a pre-trained controllable diffusion model is utilized to restore the object details on the reference image conditioned by the text descriptions and semantic masks. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at ultra-low bitrates ($\le$ 0.05 bpp). Our code is available at this https URL.

Comments:	Accepted to CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.00399 [cs.CV]
	(or arXiv:2503.00399v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.00399

Submission history

From: Lijie Yang [view email]
[v1] Sat, 1 Mar 2025 08:27:11 UTC (8,440 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators