PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Shukla, Shreya; Sharma, Nakul; Gupta, Manish; Mishra, Anand

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.15074 (cs)

[Submitted on 25 Jan 2025]

Title:PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Authors:Shreya Shukla, Nakul Sharma, Manish Gupta, Anand Mishra

View PDF HTML (experimental)

Abstract:Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ~355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PatentLMM - a novel multimodal large language model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. We make the code and data publicly available.

Comments:	Accepted at AAAI 2025 (Main Track). Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.15074 [cs.CV]
	(or arXiv:2501.15074v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.15074

Submission history

From: Anand Mishra [view email]
[v1] Sat, 25 Jan 2025 04:45:32 UTC (10,281 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators