Themisto: Jupyter-Based Runtime Benchmark

Grotov, Konstantin; Titov, Sergey

Computer Science > Software Engineering

arXiv:2504.12365 (cs)

[Submitted on 16 Apr 2025]

Title:Themisto: Jupyter-Based Runtime Benchmark

Authors:Konstantin Grotov, Sergey Titov

View PDF HTML (experimental)

Abstract:In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Comments:	Accepted to the third Deep Learning for Code (DL4C) workshop @ ICLR 2025
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2504.12365 [cs.SE]
	(or arXiv:2504.12365v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2504.12365

Submission history

From: Konstantin Grotov [view email]
[v1] Wed, 16 Apr 2025 16:07:18 UTC (385 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SE

< prev | next >

new | recent | 2025-04

Change to browse by:

cs
cs.AI
cs.LG

References & Citations

export BibTeX citation

Computer Science > Software Engineering

Title:Themisto: Jupyter-Based Runtime Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Themisto: Jupyter-Based Runtime Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators