MICE: Mining Idioms with Contextual Embeddings

Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko

doi:10.1016/j.knosys.2021.107606

Computer Science > Computation and Language

arXiv:2008.05759 (cs)

[Submitted on 13 Aug 2020 (v1), last revised 10 Nov 2021 (this version, v2)]

Title:MICE: Mining Idioms with Contextual Embeddings

Authors:Tadej Škvorc, Polona Gantar, Marko Robnik-Šikonja

View PDF

Abstract:Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach, called MICE, that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches, and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate cross-lingual transfer of developed models and analyze the size of the required dataset.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2008.05759 [cs.CL]
	(or arXiv:2008.05759v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2008.05759
Related DOI:	https://doi.org/10.1016/j.knosys.2021.107606

Submission history

From: Tadej Škvorc [view email]
[v1] Thu, 13 Aug 2020 08:56:40 UTC (587 KB)
[v2] Wed, 10 Nov 2021 11:20:28 UTC (641 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-08

Change to browse by:

cs
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Marko Robnik-Sikonja

export BibTeX citation

Computer Science > Computation and Language

Title:MICE: Mining Idioms with Contextual Embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MICE: Mining Idioms with Contextual Embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators