V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Rahman, Abdur; Chawla, Rajat; Kumar, Muskaan; Datta, Arkajit; Jha, Adarsh; NS, Mukunda; Bhola, Ishaan

Computer Science > Artificial Intelligence

arXiv:2405.15341 (cs)

[Submitted on 24 May 2024 (v1), last revised 21 Jul 2024 (this version, v2)]

Title:V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Authors:Abdur Rahman, Rajat Chawla, Muskaan Kumar, Arkajit Datta, Adarsh Jha, Mukunda NS, Ishaan Bhola

View PDF HTML (experimental)

Abstract:In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text, images, and Graphical User Interfaces (GUIs). Despite these advancements, the nuanced interaction and understanding of GUIs pose a significant challenge, limiting the potential of existing models to enhance automation levels. To bridge this gap, this paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding. Equipped with dual-resolution image encoders, V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems. Complementing V-Zen is the GUIDE dataset, an extensive collection of real-world GUI elements and task-based sequences, serving as a catalyst for specialised fine-tuning. The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences. This paper extends an invitation to the research community to join this exciting journey, shaping the future of GUI automation. In the spirit of open science, our code, data, and model will be made publicly available, paving the way for multimodal dialogue scenarios with intricate and precise interactions.

Comments:	12 pages, 5 figures, 3 tables
Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.15341 [cs.AI]
	(or arXiv:2405.15341v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2405.15341

Submission history

From: Abdur Rahman [view email]
[v1] Fri, 24 May 2024 08:21:45 UTC (2,159 KB)
[v2] Sun, 21 Jul 2024 07:34:44 UTC (1,899 KB)

Computer Science > Artificial Intelligence

Title:V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators