Automatic transcription of sound signals can convert audio to musical notes, which has significant research value. This paper extracted dual-channel constant Q transform (CQT) spectra from piano audio as features. In the design of the automatic transcription model, a CNN was employed to extract local features and then combined with a Transformer model to obtain global features. A CNN-Transformer automatic transcription model was established using a two-layer CNN and three-layer Transformers. Experiments were conducted on the MAPS and MAESTRO datasets. The results showed that dual-channel CQT outperformed short-time Fourier transform (STFT) and mono CQT in auto-transcription. Dual-channel CQT achieved the best results on frame-level transcription for the MAPS dataset, with a P value of 0.9115, an R value of 0.8055, and an F1 value of 0.8551. A sliding window with seven frames yielded the best transcription results. Compared with the deep neural network and CNN models, the CNN-Transformer model demonstrated superior performance, achieving an F1 value of 0.8551 and 0.9042 at the frame level for MAPS and MAESTRO datasets, respectively. These findings confirm the designed model's reliability for automatic piano audio transcription and highlight its practical applicability.