Scene text localization and recognition is a topic in computer vision that aims to delimit candidate regions in an input image containing incidental scene text elements. The challenge of this research consists in devising detectors capable of dealing with a wide range of variability, such as font size, font style, color, complex background, text in different languages, among others. This work presents a comparison between two strategies of building classification models, based on a Convolution Neural Network method, to detect textual elements in multiple languages in images: (i) classification model built on a multi-lingual training scenario; and (ii) classification model built on a language-specific training scenario. The experiments designed in this work indicate that language-specific model outperforms the classification model trained over a multi-lingual scenario, with an improvement of 14.79%, 8.94%, and 11.43%, in terms of precision, recall, and F-measure values, respectively.