By precisely controlling the distance between two train sets, virtual coupling (VC) enables flexible coupling and decoupling in urban rail transit. However, relying on train-to-train communication for obtaining the train distance can pose a safety risk in case of communication malfunctions. In this paper, a distance-estimation framework based on monocular vision is proposed. First, key structure features of the target train are extracted by an object-detection neural network, whose strategies include an additional detection head in the feature pyramid, labeling of object neighbor areas, and semantic filtering, which are utilized to improve the detection performance for small objects. Then, an optimization process based on multiple key structure features is implemented to estimate the distance between the two train sets in VC. For the validation and evaluation of the proposed framework, experiments were implemented on Beijing Subway Line 11. The results show that for train sets with distances between 20 m and 100 m, the proposed framework can achieve a distance estimation with an absolute error that is lower than 1 m and a relative error that is lower than 1.5%, which can be a reliable backup for communication-based VC operations.