In the past decade, machine learning technology based on deep neural networks has made great progress, thanks to the continuous development of HPC hardware and software and practical applications. There are already a number of organizations and enterprises offering services to the public based on machine learning systems, such as face and speech recognition, photo optimization, and so on. Deep neural networks also require different computational power, so the demand for distributed neural network systems is also increasing. This paper mainly studies the collaborative optimization method of distributed systems based on deep learning. In this paper, a new distributed deep learning training system is designed and implemented by effectively combining cluster resource scheduling and distributed training based on the advantages of Pytorch in rapid neural network construction and parallel computing and cluster resource scheduling.