This paper surveys the literature relating to the application of machine learning to fault management in cellular networks from an operational perspective. We summarise the main issues as 5G networks evolve, and their implications for fault management. We describe the relevant machine learning techniques through to deep learning, and survey the progress which has been made in their application, based on the building blocks of a typical fault management system. We review recent work to develop the abilities of deep learning systems to explain and justify their recommendations to network operators. We discuss forthcoming changes in network architecture which are likely to impact fault management and offer a vision of how fault management systems can exploit deep learning in the future. We identify a series of research topics for further study in order to achieve this. INDEX TERMS Cellular networks, self healing, cell outage, cell degradation, fault diagnosis, deep learning, explainable AI.