Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Sebe, Clémence; Cohen-Boulakia, Sarah; Ferret, Olivier; Névéol, Aurélie

Computer Science > Computation and Language

arXiv:2411.19295 (cs)

[Submitted on 28 Nov 2024 (v1), last revised 10 Mar 2025 (this version, v2)]

Title:Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Authors:Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol

View PDF HTML (experimental)

Abstract:Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2411.19295 [cs.CL]
	(or arXiv:2411.19295v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.19295

Submission history

From: Clémence Sebe [view email]
[v1] Thu, 28 Nov 2024 18:04:31 UTC (155 KB)
[v2] Mon, 10 Mar 2025 14:00:23 UTC (157 KB)

Computer Science > Computation and Language

Title:Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators