One of the most significant problems affecting a concrete bridge’s safety is cracks. However, detecting concrete bridge cracks is still challenging due to their slender nature, low contrast, and background interference. The existing convolutional methods with square kernels struggle to capture crack features effectively, fail to perceive the long-range dependencies between crack regions, and have weak suppression ability for background noises, leading to low detection precision of bridge cracks. To address this problem, a multi-stage feature aggregation and structure awareness network (MFSA-Net) for pixel-level concrete bridge crack detection is proposed in this paper. Specifically, in the coding stage, a structure-aware convolution block is proposed by combining square convolution with strip convolution to perceive the linear structure of concrete bridge cracks. Square convolution is used to capture detailed local information. In contrast, strip convolution is employed to interact with the local features to establish the long-range dependence relationship between discrete crack regions. Unlike the self-attention mechanism, strip convolution also suppresses background interference near crack regions. Meanwhile, the feature attention fusion block is presented for fusing features from the encoder and decoder at the same stage, which can sharpen the edges of concrete bridge cracks. In order to fully utilize the shallow detail features and deep semantic features, the features from different stages are aggregated to obtain fine-grained segmentation results. The proposed MFSA-Net was trained and evaluated on the publicly available concrete bridge crack dataset and achieved average results of 73.74%, 77.04%, 75.30%, and 60.48% for precision, recall, F1 score, and IoU, respectively, on three typical sub-datasets, thus showing optimal performance in comparison with other existing methods. MFSA-Net also gained optimal performance on two publicly available concrete pavement crack datasets, thereby indicating its adaptability to crack detection across diverse scenarios.