Localization of users in indoor spaces is a common issue in many applications. Among various technologies, a Wi-Fi fingerprinting based localization solution has attracted much attention, since it can be easily deployed using the existing off-the-shelf mobile devices and wireless networks. However, the collection of the Wi-Fi radio map is quite labor-intensive, which limits its potential for large-scale application. In this paper, a visual-based approach is proposed for the construction of a radio map in anonymous indoor environments. This approach collects multi-sensor data, e.g., Wi-Fi signals, video frames, inertial readings, when people are walking in indoor environments with smartphones in their hands. Then, it spatially recovers the trajectories of people by using both visual and inertial information. Finally, it estimates the location of fingerprints from the trajectories and constructs a Wi-Fi radio map. Experiment results show that the average location error of the fingerprints is about 0.53 m. A weighted k-nearest neighbor method is also used to evaluate the constructed radio map. The average localization error is about 3.2 m, indicating that the quality of the constructed radio map is at the same level as those constructed by site surveying. However, this approach can greatly reduce the human labor cost, which increases the potential for applying it to large indoor environments.