Inspections and condition monitoring of the stormwater pipe networks have become increasingly crucial due to their vast geographical span and complex structure. Unmanaged pipelines present significant risks, such as water leakage and flooding, posing threats to urban infrastructure. However, only a small percentage of pipelines undergo annual inspections. The current practice of CCTV inspections is labor-intensive, time-consuming, and lacks consistency in judgment. Therefore, this study aims to propose a cost-effective and efficient semi-automated approach that integrates computer vision technology with Deep Learning (DL) algorithms. A DL model is developed using YOLOv8 with instance segmentation to identify six types of defects as described in Water Services Association (WSA) Code of Australia. CCTV footage from Banyule City Council was incorporated into the model, achieving a mean average precision (mAP@0.5) of 0.92 for bounding boxes and 0.90 for masks. A cost–benefit analysis is conducted to assess the economic viability of the proposed approach. Despite the high initial development costs, it was observed that the ongoing annual costs decreased by 50%. This model allowed for faster, more accurate, and consistent results, enabling the inspection of additional pipelines each year. This model serves as a tool for every local council to conduct condition monitoring assessments for stormwater pipeline work in Australia, ultimately enhancing resilient and safe infrastructure asset management.