In this research, an attention-based feature fusion network (AFFNet), with a backbone residual network (ResNet101) enhanced with two attention mechanism modules, is proposed for automatic pixel-level detection of concrete crack. In particular, the inclusion of attention mechanism modules, for example, the vertical and horizontal compression attention module (VH-CAM) and the efficient channel attention upsample module (ECAUM), is to enable selective concentration on the crack feature. The VH-CAM generates a feature map integrating pixel-level information in vertical and horizontal directions. The ECAUM applied on each decoder layer combines efficient channel attention (ECA) and feature fusion, which can provide rich contextual information as guidance to help low-level features recover crack localization. The proposed model is evaluated on the test dataset and the results reach 84.49% for mean intersection over union (MIoU). Comparison with other state-of-the-art models proves high efficiency and accuracy of the proposed method.