PoM: Efficient Image and Video Generation with the Polynomial Mixer

Picard, David; Dufour, Nicolas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.12663 (cs)

[Submitted on 19 Nov 2024]

Title:PoM: Efficient Image and Video Generation with the Polynomial Mixer

Authors:David Picard, Nicolas Dufour

View PDF HTML (experimental)

Abstract:Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2411.12663 [cs.CV]
	(or arXiv:2411.12663v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.12663

Submission history

From: David Picard [view email]
[v1] Tue, 19 Nov 2024 17:16:31 UTC (28,238 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PoM: Efficient Image and Video Generation with the Polynomial Mixer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PoM: Efficient Image and Video Generation with the Polynomial Mixer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators