CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Wei, Anjiang; Suresh, Tarun; Cao, Jiannan; Kannan, Naveen; Wu, Yuheng; Yan, Kai; Teixeira, Thiago S. F. X.; Wang, Ke; Aiken, Alex

Computer Science > Programming Languages

arXiv:2503.23145 (cs)

[Submitted on 29 Mar 2025]

Title:CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Authors:Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken

View PDF HTML (experimental)

Abstract:Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning.

Subjects:	Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2503.23145 [cs.PL]
	(or arXiv:2503.23145v1 [cs.PL] for this version)
	https://doi.org/10.48550/arXiv.2503.23145

Submission history

From: Anjiang Wei [view email]
[v1] Sat, 29 Mar 2025 16:50:39 UTC (234 KB)

Computer Science > Programming Languages

Title:CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Programming Languages

Title:CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators