Labeling large amounts of speech is laborious and expensive. The scarcity of speech with the accent or in specific scenes hangs the further applications of the ASR system in practice. On the contrary, collecting speech and domain-related text corpus is more achievable. In this work, we propose an endto-end model called Spiker-Converter for the low-resource speech recognition task. It decomposes the ASR task by introducing additional acoustic supervision, dramatically reduce the demand for labeled samples. Besides, we provide a semisupervised training method, which consumes a few labeled speech samples but large amounts of unlabeled speech and domain-related text. Specifically, we use a Discriminator to produce learning signals for the ASR model with unlabeled speech as input. Note that we apply adversarial training to part of the ASR model, ensuring stability. Experiments show the significant effectiveness of our semi-supervised training method. For now, our method can only be used for Chineselike languages, but it shows a potential direction to solve lowresource speech recognition tasks.