Sanitary sewer systems are major infrastructures in every modern city, which are essential in protecting water pollution and preventing urban waterlogging. Since the conditions of sewer systems continuously deteriorate over time due to various defects and extrinsic factors, early intervention in the defects is necessary to prolong the service life of the pipelines. However, prior works for defect inspection are limited by accuracy, efficiency, and economic cost. In addition, the current loss functions in object detection approaches are unable to handle the imbalanced data well. To address the above drawbacks, this paper proposes an automatic defect detection framework that accurately identifies and localizes eight types of defects in closed‐circuit television videos based on a deep neural network. First, an effective attention module is introduced and used in the backbone of the detector for better feature extraction. Then, a novel feature fusion mechanism is presented in the neck to alleviate the problem of feature dilution. After that, an efficient loss function that can reasonably adjust the weight of training samples is proposed to tackle the imbalanced data problem. Also, a publicly available dataset is provided for defect detection tasks. The proposed detection framework is robust against the imbalanced data and achieves a state‐of‐the‐art mean average precision of 73.4%, which is potentially applied in realistic sewer defect inspections.