Cryogenic electron tomography (cryoET) directly visualizes subcellular structures in 3D at the nanometer scale. Quantitative analyses of cryoET data can reveal structural biomarkers of diseases, provide novel mechanistic insights, and inform the effects of treatments on phenotype. However, existing automated annotation approaches primarily focus on localizing molecular features with few methods accurately quantifying complex structures such as organelles. We address this challenge with CryoViT, a paradigm shift from traditional convolutional neural networks that leverages vision transformers to enhance the segmentation of large pleomorphic structures that can occupy almost the entire field of view in high-magnification images, such as mitochondria. CryoViT is powered by a large-scale vision foundation model and overcomes limitations of popular U-Net based methods, particularly when training data are scarce. We demonstrate the efficacy of CryoViT on a large cryoET dataset of neurons differentiated from iPSCs derived from Huntington disease (HD) patients and cultured HD mouse model neurons.