Nowadays Automatic Speech Recognition (ASR) systems can accurately recognize which words are said. However, due to the disfluency, grammatical error, and other phenomena in spontaneous speech, the verbatim transcription of ASR impairs its readability, which is crucial for human comprehension and downstream tasks processing that need to understand the meaning and purpose of what is spoken. In this work, we formulate the ASR post-processing for readability (APR) as a sequence-to-sequence text generation problem that aims to transform the incorrect and noisy ASR output into readable text for humans and downstream tasks. We leverage the Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. To solve the problem of too little training data, we propose a novel data augmentation method that synthesizes large-scale training data from the grammatical error correction dataset. We propose a model based on the pre-trained language model to perform the APR task and train the model with a two-stage training strategy to better exploit the augmented data. On the constructed test set, our approach outperforms the best baseline system by a large margin of 17.53 on BLEU and 13.26 on readability-aware WER (RA-WER). The human evaluation also shows that our model can generate more human-readable transcripts than the baseline method.INDEX TERMS automatic post-editing, ASR post-processing for readability, data augmentation, pretrained language model, natural language processing