CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Gu, Jiawei; Yang, Zacc; Ding, Chuanghao; Zhao, Rui; Tan, Fei

Computer Science > Computation and Language

arXiv:2407.17467 (cs)

[Submitted on 24 Jul 2024 (v1), last revised 7 Oct 2024 (this version, v2)]

Title:CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Authors:Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

Comments:	EMNLP 2024 main conference
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2407.17467 [cs.CL]
	(or arXiv:2407.17467v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.17467

Submission history

From: Jiawei Gu [view email]
[v1] Wed, 24 Jul 2024 17:59:02 UTC (9,321 KB)
[v2] Mon, 7 Oct 2024 05:16:25 UTC (9,327 KB)

Computer Science > Computation and Language

Title:CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators