Dialects are language variations that occur due to differences in social groups or geographical regions. Dialect speech recognition is the approach to accurately transcribe spoken language that involves regional variation in vocabulary, syntax, and pronunciation. Models need to be trained on various dialects to handle linguistic differences effectively. The latest advancements in automatic speech recognition (ASR) and complex systems methods are showing progress in recurrent neural networks (RNN), deep neural networks (DNN), and convolutional neural networks (CNN). Multi-dialect speech recognition remains a challenge, notwithstanding the progress of deep learning (DL) in speech recognition for many computing applications in environmental modeling and smart cities. Even though the dialect-specific acoustic model is known to perform well, it is not easier to maintain when the number of dialects for all the languages is large and dialect-specific data are limited. This paper offers an Automated Multi-Dialect Speech Recognition using the Stacked Attention-based Deep Learning (MDSR-SADL) technique in environmental modeling and smart cities. The MDSR-SADL technique primarily applies the DL model to identify various dialects. In the MDSR-SADL technique, stacked long short-term memory with attention-based autoencoder (SLSTM-AAE) model is used, which integrates stack modeling with LSTM and AE. Besides, the attention model enables dialect identification by offering dialect details for speech identification. The MDSR-SADL model uses the Fractals Harris Hawks Optimization (FHHO) model for hyperparameter selection. A sequence of simulations was implemented to illustrate the improved solution of the MDSR-SADL model. The experimental investigation of the MDSR-SADL technique exhibits superior accuracy values of 99.52% and 99.55% over other techniques under Tibetan and Chinese datasets.