Spatially modulated grid cells has been recently found in the rat secondary visual cortex (V2) during activation navigation. However, the computational mechanism and functional significance of V2 grid cells remain unknown, and a theory-driven conceptual model for experimentally observed visual grids is missing. To address the knowledge gap and make experimentally testable predictions, here we trained a biologically-inspired excitatory-inhibitory recurrent neural network (E/I-RNN) to perform a two-dimensional spatial navigation task with multisensory (e.g., velocity, acceleration, and visual) input. We found grid-like responses in both excitatory and inhibitory RNN units, and these grid responses were robust with respect to the choices of spatial cues, dimensionality of visual input, activation function, and network connectivity. Dimensionality reduction analysis of population responses revealed a low-dimensional torus-like manifold and attractor, showing the stability of grid patterns with respect to new visual input, new trajectory and relative speed. We found that functionally similar receptive fields with strong excitatory-to-excitatory connection appeared within fully connected as well as structurally connected networks, suggesting a link between functional grid clusters and structural network. Additionally, multistable torus-like attractors emerged with increasing sparsity in inter- and intra-subnetwork connectivity. Finally, irregular grid patterns were found in a convolutional neural network (CNN)-RNN architecture while performing a visual sequence recognition task. Together, our results suggest new computational mechanisms of V2 grid cells in both spatial and non-spatial tasks.