2021
DOI: 10.48550/arxiv.2110.06894
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Abstract: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applicati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 19 publications
(37 reference statements)
0
1
0
Order By: Relevance
“…Table 3 shows our competition results for DSTC10. The table also shows the results of the baseline system by the organizers based on a Transformer encoder-decoder using I3D and Vggish (Shah et al 2021) and the subjective score for the ground truth answers. The subjective evaluation for our models examined only the fixed-frame model.…”
Section: Resultsmentioning
confidence: 99%
“…Table 3 shows our competition results for DSTC10. The table also shows the results of the baseline system by the organizers based on a Transformer encoder-decoder using I3D and Vggish (Shah et al 2021) and the subjective score for the ground truth answers. The subjective evaluation for our models examined only the fixed-frame model.…”
Section: Resultsmentioning
confidence: 99%