Handwritten Mathematical Expression Recognition (HMER) aims to convert complex mathematical expression images into
LaTeX expressions, which is of great significance in electronic documents and electronic education. The attention-based encoder-decoder architecture is very popular in HMER tasks. However, in the absence of large and medium-sized handwritten mathematical expression datasets, it is impossible to establish a pre-training task, and it is difficult to train a strong encoder from scratch. In addition, usually the decoder directly uses the high-level semantic features extracted by the encoder for decoding, and the low-level details are filtered out in the pooling layer, so the decoder pays little attention to the low-level information. In order to alleviate these two problems, this paper proposes a new ClipMath model. Its encoder is extended to a formula-text matching network based on CLIP-enhancement, including a formula-head and a text-head, which are used to cross-modal alignment of the image-state formula and the textual LaTeX label. The decoder uses the proposed Cascaded Multi-scale Decoder, which can incorporate features of different scales in the decoding process and improve the attention to detail information. Experiments show that the ExpRate of the proposed method on CROHME 2014, CROHME2016 and CROHME 2019 is 1.22%, 0.64% and 2.01% higher than the state-of-the-art model respectively.