Cross-modal Contrastive Distillation for Instructional Activity Anticipation

Yang, Zhengyuan; Liu, Jingen; Huang, Jing; He, Xiaodong; Mei, Tao; Xu, Chenliang; Luo, Jiebo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2201.06734 (cs)

[Submitted on 18 Jan 2022]

Title:Cross-modal Contrastive Distillation for Instructional Activity Anticipation

Authors:Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo

View PDF

Abstract:In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation. Unlike previous anticipation tasks that aim at action label prediction, our work targets at generating natural language outputs that provide interpretable and accurate descriptions of future action steps. It is a challenging task due to the lack of semantic information extracted from the instructional videos. To overcome this challenge, we propose a novel knowledge distillation framework to exploit the related external textual knowledge to assist the visual anticipation task. However, previous knowledge distillation techniques generally transfer information within the same modality. To bridge the gap between the visual and text modalities during the distillation process, we devise a novel cross-modal contrastive distillation (CCD) scheme, which facilitates knowledge distillation between teacher and student in heterogeneous modalities with the proposed cross-modal distillation loss. We evaluate our method on the Tasty Videos dataset. CCD improves the anticipation performance of the visual-alone student model by a large margin of 40.2% relatively in BLEU4. Our approach also outperforms the state-of-the-art approaches by a large margin.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2201.06734 [cs.CV]
	(or arXiv:2201.06734v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2201.06734

Submission history

From: Zhengyuan Yang [view email]
[v1] Tue, 18 Jan 2022 04:20:33 UTC (1,161 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Contrastive Distillation for Instructional Activity Anticipation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Contrastive Distillation for Instructional Activity Anticipation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators