“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”