The present study explored the comparability in performance scores between the computer-delivered and face-to-face modes for the two speaking tests in the Vietnamese Standardized Test of English Proficiency (VSTEP) (the VSTEP.2 and VSTEP.3–5 Speaking tests) according to Vietnam’s Six-Level Foreign Language Proficiency Framework (VNFLPF) and test takers’ experiences. Data were collected from 75 and 82 VSTEP.2 and VSTEP.3–5 university English-majored test takers respectively in both computer-delivered and face-to-face conditions. A counterbalanced research design was adopted to minimise mode order effects. After test completion, 30 of the test takers, 15 from each proficiency test, were interviewed in the focus group format of 3–4 members per group. The results indicated mixed, selective effects of the testing mode. Overall, test scores were comparable in the VSTEP.2 Speaking test but significantly higher in favour of the face-to-face mode for the VSTEP.3–5 Speaking test. However, the statistically significant difference was observed in only one measure of the many analytical criteria (content development in the former test, and pronunciation in the latter test) with mixed mode advantages. The interview data has provided rich refreshing insights into how test takers viewed each testing mode against real-life communication. Their experiences further revealed a wide range of affective preferences involved in the inherent affordances or constraints of each testing mode and their communication and performance/outcome orientation. The findings offer important implications for extrapolation, test preparation and administration, and test taker/rater training in the particular context of the two English speaking proficiency tests in Vietnam and perhaps beyond.