Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Kamo, Naoyuki; Tawara, Naohiro; Ando, Atsushi; Kano, Takatomo; Sato, Hiroshi; Ikeshita, Rintaro; Moriya, Takafumi; Horiguch, Shota; Matsuura, Kohei; Ogawa, Atsunori; Plaquet, Alexis; Ashihara, Takanori; Ochiai, Tsubasa; Mimura, Masato; Delcroix, Marc; Nakatani, Tomohiro; Asami, Taichi; Araki, Shoko

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2502.09859 (eess)

[Submitted on 14 Feb 2025]

Title:Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Authors:Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguch, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

View PDF HTML (experimental)

Abstract:In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among the distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. We present the results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems.

Comments:	55 pages, 12 figures
Subjects:	Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Cite as:	arXiv:2502.09859 [eess.AS]
	(or arXiv:2502.09859v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2502.09859

Submission history

From: Naoyuki Kamo [view email]
[v1] Fri, 14 Feb 2025 01:46:54 UTC (1,152 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators