ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Jin, Yiqiao; Petrangeli, Stefano; Shen, Yu; Wu, Gang

doi:10.1145/3701716.3718379

Computer Science > Computation and Language

arXiv:2503.20978 (cs)

[Submitted on 26 Mar 2025]

Title:ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Authors:Yiqiao Jin, Stefano Petrangeli, Yu Shen, Gang Wu

View PDF HTML (experimental)

Abstract:Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

Comments:	Accepted to MM4SG Workshop at The Web Conference 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.20978 [cs.CL]
	(or arXiv:2503.20978v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.20978
Related DOI:	https://doi.org/10.1145/3701716.3718379

Submission history

From: Yiqiao Jin [view email]
[v1] Wed, 26 Mar 2025 20:41:24 UTC (3,262 KB)

Computer Science > Computation and Language

Title:ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators