“…The key contributions of this paper are summarized as follows: • We introduce a semi-supervised SLU framework for learning semantics from speech to alleviate: (1) the need for a large amount of in-house, homogenous data [2,7,8,17], (2) the limitation of only intent classification [8,9,13] by predicting text, slots and intents, and (3) any additional manipulation on labels or loss, such as label projection [26], output serialization [7,18,19], ASR n-best hypothesis, or ASR-robust training losses [13,27]. Figure 2 illustrates our approach.…”