Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Mo{ż}d{ż}onek, Michał; Wróblewska, Anna; Tkachuk, Sergiy; Łukasik, Szymon

Computer Science > Computation and Language

arXiv:2205.15712v1 (cs)

[Submitted on 31 May 2022 (this version), latest version 1 Jun 2022 (v2)]

Title:Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Authors:Michał Mo{ż}d{ż}onek, Anna Wróblewska, Sergiy Tkachuk, Szymon Łukasik

View PDF

Abstract:Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better.
Additionally, we prepared a new dataset -- this http URL -- that is entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

Comments:	11 pages, 5 figures
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2205.15712 [cs.CL]
	(or arXiv:2205.15712v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.15712

Submission history

From: Anna Wróblewska [view email]
[v1] Tue, 31 May 2022 12:00:05 UTC (551 KB)
[v2] Wed, 1 Jun 2022 07:59:45 UTC (551 KB)

Computer Science > Computation and Language

Title:Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators