Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Herrera, Santiago; Corro, Caio; Kahane, Sylvain

Computer Science > Computation and Language

arXiv:2403.17534 (cs)

[Submitted on 26 Mar 2024]

Title:Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Authors:Santiago Herrera, Caio Corro, Sylvain Kahane

View PDF HTML (experimental)

Abstract:Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.

Comments:	Published in LREC-Coling 2024 proceedings
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.17534 [cs.CL]
	(or arXiv:2403.17534v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.17534

Submission history

From: Caio Corro [view email]
[v1] Tue, 26 Mar 2024 09:39:53 UTC (865 KB)

Computer Science > Computation and Language

Title:Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators