“…Context is either modeled through a condensed vector of contextual cues (e.g., Chen et al, 2018;Colombo et al, 2020;Kumar et al, 2018;Raheja & Tetreault, 2019) or directly in the neural network structure with a node for each utterance (e.g., Cerisara et al, 2018;Kalchbrenner & Blunsom, 2013;Ortega & Vu, 2017;Ribeiro et al, 2019b). Some studies that used deep learning for dialog act classification did not include contextual cues (e.g., Duran & Battle, 2018;Duran et al, 2023;Khanpour et al, 2016;Ribeiro et al, 2019a), but most encoded context through using a condensed surface encoding of the previous utterances (e.g., Chen et al, 2018;Kumar et al, 2018;Yano et al, 2021;Zhao & Kawahara, 2019). Most deep learning models have regarded dialog act classification as classifying a sequence of dialog acts, without paying attention to the speaker or the structure of utterances into turns.…”