Modern machine learning (ML) applications are often deployed in the cloud environment to exploit the computational power of clusters. However, this in-cloud computing scheme cannot satisfy the demands of emerging edge intelligence scenarios, including providing personalized models, protecting user privacy, adapting to real-time tasks and saving resource cost. In order to conquer the limitations of conventional in-cloud computing, it comes the rise of on-device learning, which makes the endto-end ML procedure totally on user devices, without unnecessary involvement of the cloud. In spite of the promising advantages of on-device learning, implementing a high-performance on-device learning system still faces with many severe challenges, such as insufficient user training data, backward propagation blocking and limited peak processing speed. Observing the substantial improvement space in the implementation and acceleration of on-device learning systems, we intend to present a comprehensive analysis of the latest research progress and point out potential optimization directions from the system perspective. This survey presents a software and hardware synergy of on-device learning techniques, covering the scope of model-level neural network design, algorithm-level training optimization and hardware-level instruction acceleration. We hope this survey could bring fruitful discussions and inspire the researchers to further promote the field of edge intelligence.