Our recognition of an object is consistent across conditions, unaffected by motion, perspective, rotation, and corruption. This robustness is thought to be enabled by invariant object representations, but how the brain achieves it remains unknown. In artificial neural networks, learning to represent objects is simulated as an optimization process. The system reduces discrepancies between actual and desired outputs by updating specific connections through mechanisms such as error backpropagation. These operations are biologically implausible primarily because they require individual connections at all levels to be sensitive to errors found at the late stages of the network. On the other hand, learning in the nervous system occurs locally, and synaptic changes depend only on pre- and post-synaptic activities. It is unclear how local updates translate into coordinated changes across large populations of neurons and lead to sophisticated cognitive functions. Here we demonstrate that it is possible to achieve robust and invariant object representations in naturally observed network architectures using only biologically realistic local learning rules. Adopting operations fundamentally different from current ANN models, unsupervised recurrent networks can learn to represent and categorize objects through sensory experiences without propagating or detecting errors. This white box, fully interpretable networks can extract clean images from their corrupted forms and produce representations prospectively robust against unfamiliar perturbations. Continuous learning does not cause catastrophic forgetting commonly observed in ANNs. Without explicit instructions, the networks can classify objects and represent the identity of 3D objects regardless of perspective, size, or position. These findings have substantial implications for understanding how biological brains achieve invariant object representation and for developing biologically realistic intelligent networks that are efficient and robust.