“…One way to train a CNN for speech recognition is through the use of spectrograms as input images, (Li et al , 2013). Various techniques have been investigated to the end of building robust speech recognizers and/or to cater for multiple languages (Seki et al , 2017; Wu et al , 2017; Kundu et al , 2016; Chen and Mak, 2015; Zhao et al , 2014); however, in this context, a simple speech recognition was required, with the ability to categorize audio inputs as one of seven possibilities, namely, “Yes”, “No”, “Okay”, “Don’t”, “Wait”, “Stop” and Negative. The first six categories included two positive answers and four negative answers, given that the monitoring system shall ask predefined yes/no questions to the operator when in need of further clarification, and the negative category consisted of examples of background noise.…”