L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Zhang, Junxuan; Cheng, Zhengxue; Zhao, Yan; Wang, Shihao; Zhou, Dajiang; Lu, Guo; Song, Li

Computer Science > Computation and Language

arXiv:2412.16642 (cs)

[Submitted on 21 Dec 2024 (v1), last revised 24 Dec 2024 (this version, v2)]

Title:L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Authors:Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song

View PDF HTML (experimental)

Abstract:Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multimedia (cs.MM)
Cite as:	arXiv:2412.16642 [cs.CL]
	(or arXiv:2412.16642v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.16642

Submission history

From: Yan Zhao [view email]
[v1] Sat, 21 Dec 2024 14:24:32 UTC (3,645 KB)
[v2] Tue, 24 Dec 2024 04:20:18 UTC (3,645 KB)

Computer Science > Computation and Language

Title:L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators