Text classification is not only a prerequisite for natural language processing work, such as sentiment analysis and natural language reasoning, but is also of great significance for screening massive amounts of information in daily life. However, the performance of classification algorithms is always affected due to the diversity of language expressions, inaccurate semantic information, colloquial information, and many other problems. We identify three clues in this study, namely, core relevance information, semantic location associations, and the mining characteristics of deep and shallow networks for different information, to cope with these challenges. Two key insights about the text are revealed based on these three clues: key information relationship and word group inline relationship. We propose a novel attention feature fusion network, Attention Pyramid Transformer (APTrans), which is capable of learning the core semantic and location information from sentences using the above-mentioned two key insights. Specially, a hierarchical feature fusion module, Feature Fusion Connection (FFCon), is proposed to merge the semantic features of higher layers with positional features of lower layers. Thereafter, a Transformer-based XLNet network is used as the backbone to initially extract the long dependencies from statements. Comprehensive experiments show that APTrans can achieve leading results on the THUCNews Chinese dataset, AG News, and TREC-QA English dataset, outperforming most excellent pre-trained models. Furthermore, extended experiments are carried out on a self-built Chinese dataset theme analysis of teachers’ classroom corpus. We also provide visualization work, further proving that APTrans has good potential in text classification work.