ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Wang, Yubo; Ma, Xueguang; Nie, Ping; Zeng, Huaye; Lyu, Zhiheng; Zhang, Yuxuan; Schneider, Benjamin; Lu, Yi; Yue, Xiang; Chen, Wenhu

Computer Science > Computation and Language

arXiv:2504.00824v2 (cs)

[Submitted on 1 Apr 2025 (v1), last revised 3 Apr 2025 (this version, v2)]

Title:ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Authors:Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen

View PDF HTML (experimental)

Abstract:Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their ability to support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], which is then used to query a citation database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to improve efficiency. Our model is built upon Qwen-2.5-7B and trained on 500K papers from arXiv. It achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality -- measured across relevance, coherence, academic rigor, completeness, and innovation -- significantly surpassing all existing models, including much larger ones like the Retrieval-Augmented Qwen2.5-72B-Instruct. Human studies further demonstrate that ScholarCopilot, despite being a 7B model, significantly outperforms ChatGPT, achieving 100% preference in citation quality and over 70% in overall usefulness.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2504.00824 [cs.CL]
	(or arXiv:2504.00824v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.00824

Submission history

From: Yubo Wang [view email]
[v1] Tue, 1 Apr 2025 14:12:14 UTC (5,392 KB)
[v2] Thu, 3 Apr 2025 15:07:29 UTC (5,392 KB)

Computer Science > Computation and Language

Title:ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators