This article introduces an indoor topological localization algorithm that uses vision and Wi-Fi signals. Its main contribution is a novel way of merging data from these sensors. The designed system does not require knowledge of the building plan or the positions of the Wi-Fi access points. By making the Wi-Fi signature suited to the FABMAP algorithm, this work develops an early fusion framework that solves global localization and kidnapped robot problems. The resulting algorithm has been tested and compared with FABMAP visual localization, over data acquired by a Pepper robot in three different environments: an office building, a middle school, and a private apartment. Numerous runs of different robots have been realized over several months for a total covered distance of 6.4 km. Constraints were applied during acquisitions to make the experiments fitted to real use cases of Pepper robots. Without any tuning, our early fusion framework outperforms visual localization in all testing situations and with a significant margin in environments where vision faces problems such as moving objects or perceptual aliasing. In such conditions, 90.6% of estimated localizations are less than 5 m away from ground truth with our early fusion framework compared with 77.6% with visual localization. Furthermore, compared with other classical fusion strategies, the early fusion framework produces the best localization results because in all tested situations, it improves visual localization results without damaging them where Wi-Fi signals carry little information.