Objective. Accurate segmentation of head and neck (H&N) tumors is critical in radiotherapy. However, the existing methods lack effective strategies to integrate local and global information, strong semantic information and context information, and spatial and channel features, which are effective clues to improve the accuracy of tumor segmentation. In this paper, we propose a novel method called dual modules convolution transformer network (DMCT-Net) for H&N tumor segmentation in the fluorodeoxyglucose positron emission tomography/computed tomography (FDG-PET/CT) images. Approach. The DMCT-Net consists of the convolution transformer block (CTB), the squeeze and excitation (SE) pool module, and the multi-attention fusion (MAF) module. First, the CTB is designed to capture the remote dependency and local multi-scale receptive field information by using the standard convolution, the dilated convolution, and the transformer operation. Second, to extract feature information from different angles, we construct the SE pool module, which not only extracts strong semantic features and context features simultaneously but also uses the SE normalization to adaptively fuse features and adjust feature distribution. Third, the MAF module is proposed to combine the global context information, channel information, and voxel-wise local spatial information. Besides, we adopt the up-sampling auxiliary paths to supplement the multi-scale information. Main results. The experimental results show that the method has better or more competitive segmentation performance than several advanced methods on three datasets. The best segmentation metric scores are as follows: DSC of 0.781, HD95 of 3.044, precision of 0.798, and sensitivity of 0.857. Comparative experiments based on bimodal and single modal indicate that bimodal input provides more sufficient and effective information for improving tumor segmentation performance. Ablation experiments verify the effectiveness and significance of each module. Significance. We propose a new network for 3D H&N tumor segmentation in FDG-PET/CT images, which achieves high accuracy.