Accurate and robust anatomical landmark localization is a mandatory and crucial step in deformation diagnosis and treatment planning for patients with craniomaxillofacial (CMF) malformations. In this paper, we propose a trainable end-to-end cephalometric landmark localization framework on CBCT scans, referred to as CMF-Net, which combines the appearance with transformers, geometric constraint, and adaptive wing (AWing) loss. More precisely: 1) We decompose the localization task into two branches: the appearance branch integrates transformers for identifying the exact positions of candidates, while the geometric constraint branch at low resolution allows the implicit spatial relationships to be effectively learned on the reduced training data. 2) We use the AWing loss to leverage the difference between the pixel values of the target heatmaps and the automatic prediction heatmaps. We verify our CMF-Net by identifying the 24 most relevant clinical landmarks on 150 dental CBCT scans with complicated scenarios collected from real-world clinics. Comprehensive experiments show that it performs better than the state-of-the-art deep learning methods, with an average localization error of 1.108 mm (the clinically acceptable precision range being 1.5 mm) and a correct landmark detection rate equal to 79.28%. Our CMF-Net is time-efficient and able to locate skull landmarks with high accuracy and significant robustness. This approach could be applied in 3D cephalometric measurement, analysis, and surgical planning.