The paradigm of harnessing encoder-decoder frameworks, underpinned by Transformer constituents, has emerged as an exemplar in the realm of image deblurring architectural designs. In this investigation, we critically reexamine this approach. Our analysis reveals that many current architectures focus heavily on limited local regions during the feature extraction phase. This narrow focus compromises the richness and diversity of features channeled to the encoder-decoder framework, resulting in an information bottleneck. Furthermore, these designs tend to rely excessively on global features, which can lead to the neglect of crucial local details in specific areas, adversely affecting image deblurring efficacy. To address these issues, we present the a novel hierarchical patch aggregation Transformer(HPAT) architecture. In the initial feature extraction phase, we incorporate cross-axis spatial Transformer blocks that exhibit linear complexity, complemented by an adaptive hierarchical attention fusion mechanism. These enhancements enable the model to adeptly capture spatial interrelationships among features and integrate insights from multiple hierarchical layers. Subsequently, we optimize the feedforward network within the Transformer blocks of the encoder-decoder framework, leading to the development of the Fusion Feedforward Network (F3N). This innovation streamlines the aggregation of token information, bolstering the model's ability to capture and retain local details. Our comprehensive experimental assessments, conducted across a variety of publicly available datasets, confirm the effectiveness of the HPAT model. Empirical results decisively prove that our HPAT model establishes a new benchmark in image deblurring tasks.