We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal (EL) datasets. Transfer learning is often sufficient to resolve low-resourced problems; however, in EL speech, the domain shift between the pretraining and fine-tuning data is too large to overcome, limiting the ASR performance. We propose a method of reducing the domain shift gap during transfer learning between the healthy and EL datasets by conducting an intermediate fine-tuning task that uses imperfectly synthesized EL speech. Although using imperfect synthetic speech is nonintuitive, we proved the effectiveness of this method by decreasing the character error rate by up to 6.1% compared to the baselines using naive transfer learning. To further understand the model's behavior, we further analyze the produced latent spaces in each task through linguistic and identity proxy tasks and find that the intermediate fine-tuning focuses on identifying the voicing characteristics of the EL speakers. Moreover, we also show the differences between a simulated EL speaker from a real EL speaker and find that simulated EL data has pronunciation differences from real EL data, showing the huge domain gap between real EL and other speech data.