$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Nandy, Abhilash; Kapadnis, Manav Nitin; Patnaik, Sohan; Butala, Yash Parag; Goyal, Pawan; Ganguly, Niloy

Computer Science > Computation and Language

arXiv:2306.06190 (cs)

[Submitted on 9 Jun 2023 (v1), last revised 1 Nov 2024 (this version, v3)]

Title:$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Authors:Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly

View PDF HTML (experimental)

Abstract:In this paper, we propose $FastDoc$ (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around $1,000$, $4,500$, and $500$ times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that $FastDoc$ either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, $FastDoc$ shows a negligible drop in performance on open domain.

Comments:	Accepted to Transactions on Machine Learning Research (TMLR), 36 pages, 8 figures
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
MSC classes:	68T50
ACM classes:	I.2.7
Cite as:	arXiv:2306.06190 [cs.CL]
	(or arXiv:2306.06190v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.06190

Submission history

From: Abhilash Nandy [view email]
[v1] Fri, 9 Jun 2023 18:42:19 UTC (2,092 KB)
[v2] Tue, 14 Nov 2023 21:51:21 UTC (4,321 KB)
[v3] Fri, 1 Nov 2024 07:53:10 UTC (4,147 KB)

Computer Science > Computation and Language

Title:$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators