In our society, realizing intelligent positioning in indoor environments is important to build a smart city. Currently, mutual positioning requirements in the unknown indoor environment are growing fast. However, in such environment, we can obtain neither outdoor radio signal nor the indoor images in advance for online positioning. Therefore, how to achieve mutual positioning becomes an interesting problem. In this paper, we propose a vision-based mutual positioning method in an unknown indoor environment. First, two users take images of the unknown indoor environment, use semantic segmentation network to identify the semantic targets contained in the images, and upload the generated semantic sequence to the user shared database in real time. Then, every time two users reupload a semantic sequence due to a change of location, it is necessary to retrieve whether another user has uploaded the same semantic sequence in the shared database. If the retrieval is successful, it means that two users have seen the same scene. Finally, two users select a target from the two user images taken based on the same scene to establish a three-dimensional coordinate system, respectively, calculate their own position coordinates in this coordinate system, and realize mutual positioning through position coordinate sharing. Experiment results show that our proposed method can successfully realize mutual positioning between two users in an unknown indoor environment, while ensuring high positioning accuracy.