Computer Science > Computer Vision and Pattern Recognition
[Submitted on 1 Dec 2023 (v1), last revised 10 Jan 2024 (this version, v2)]
Title:Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images
View PDF HTML (experimental)Abstract:Table Structure Recognition (TSR) is a widely discussed task aiming at transforming unstructured table images into structured formats, such as HTML sequences, to make text-only models, such as ChatGPT, that can further process these tables. One type of solution is using detection models to detect table components, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based models usually cannot perform as well as other types of solutions regarding cell-level TSR metrics, such as TEDS, and the underlying reasons limiting the performance of these models on the TSR task are also not well-explored. Therefore, we revisit existing detection-based models comprehensively and explore the underlying reasons hindering these models' performance, including the improper problem definition, the mismatch issue of detection and TSR metrics, the characteristics of detection models, and the impact of local and long-range features extraction. Based on our analysis and findings, we apply simple methods to tailor a typical two-stage detection model, Cascade R-CNN, for the TSR task. The experimental results show that the tailored Cascade R-CNN based model can improve the base Cascade R-CNN model by 16.35\% on the FinTabNet dataset regarding the structure-only TEDS, outperforming other types of state-of-the-art methods, demonstrating that our findings can be a guideline for improving detection-based TSR models and that a purely detection-based solution is competitive with other types of solutions, such as graph-based and image-to-sequence solutions.
Submission history
From: Bin Xiao [view email][v1] Fri, 1 Dec 2023 16:31:17 UTC (8,834 KB)
[v2] Wed, 10 Jan 2024 15:51:21 UTC (11,606 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.