A capsule network encodes entity features into a capsule and maps a spatial relationship from the local feature to the overall feature by dynamic routing. This structure allows the capsule network to fully capture feature information but inevitably leads to a lack of spatial relationship guidance, sensitivity to noise features, and easy susceptibility to falling into local optimization. Therefore, we propose a novel capsule network based on feature and spatial relationship coding (FSc-CapsNet). Feature and spatial relationship extractors are introduced to capture features and spatial relationships, respectively. The feature extractor abstracts feature information from bottom to top, while attenuating interference from noise features, and the spatial relationship extractor provides spatial relationship guidance from top to bottom. Then, instead of dynamic routing, a feature and spatial relationship encoder is proposed to find the optimal combination of features and spatial relationships. The encoder abandons the idea of iterative optimization but adds the optimization process to the backpropagation. The experimental results show that, compared with the capsule network and its multiple derivatives, the proposed FSc-CapsNet achieves significantly better performance on both the Fashion-MNIST and CIFAR-10 datasets. In addition, compared with some mainstream deep learning frameworks, FSc-CapsNet performs quite competitively on Fashion-MNIST. © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.Traditional convolutional neural networks (CNNs) 1 have obvious limitations for exploring spatial relationships. The general method for classifying images of the same type taken from different angles is to train multiple neurons to process features and then add a top-level detection neuron to detect the classification results. This approach tends to remember the dataset rather than summarizing the solution, and it requires large amounts of training data to cover different variants and avoid overfitting. This characteristic also makes CNNs very vulnerable when dealing with tasks based on moved, rotated, or resized samples.Unlike CNNs, capsule networks (CapsuleNet) 2 use capsules 3 to capture a series of features and their variants. In the capsule network, higher-layer capsules are used to capture the overall features, such as "face" or "car," while the lower-layer capsules are used to capture local entity features such as "nose," "mouth," or "wheels," leading to a completely different approach than a convolutional network when abstracting the overall feature from local features. However, this is not enough. A complete identification process requires both bottom-up feature abstraction and top-down spatial relationship guidance. The capsule network defines a transformation matrix between adjacent capsule layers to implement feature abstraction. Then, dynamic rou...