Manual inspection of infrastructure damages such as building cracks is difficult due to the objectivity and reliability of assessment and high demands of time and costs. This can be automated using unmanned aerial vehicles (UAVs) for aerial imagery of damages. Numerous computer vision-based approaches have been applied to address the limitations of crack detection but they have their limitations that can be overcome by using various hybrid approaches based on artificial intelligence (AI) and machine learning (ML) techniques. The convolutional neural networks (CNNs), an application of the deep learning (DL) method, display remarkable potential for automatically detecting image features such as damages and are less sensitive to image noise. A modified deep hierarchical CNN architecture has been used in this study for crack detection and damage assessment in civil infrastructures. The proposed architecture is based on 16 convolution layers and a cycle generative adversarial network (CycleGAN). For this study, the crack images were collected using UAVs and open-source images of mid to high rise buildings (five stories and above) constructed during 2000 in Sydney, Australia. Conventionally, a CNN network only utilizes the last layer of convolution. However, our proposed network is based on the utility of multiple layers. Another important component of the proposed CNN architecture is the application of guided filtering (GF) and conditional random fields (CRFs) to refine the predicted outputs to get reliable results. Benchmarking data (600 images) of Sydney-based buildings damages was used to test the proposed architecture. The proposed deep hierarchical CNN architecture produced superior performance when evaluated using five methods: GF method, Baseline (BN) method, Deep-Crack BN, Deep-Crack GF, and SegNet. Overall, the GF method outperformed all other methods as indicated by the global accuracy (0.990), class average accuracy (0.939), mean intersection of the union overall classes (IoU) (0.879), precision (0.838), recall (0.879), and F-score (0.8581) values. Overall, the proposed CNN architecture provides the advantages of reduced noise, highly integrated supervision of features, adequate learning, and aggregation of both multi-scale and multilevel features during the training procedure along with the refinement of the overall output predictions.