Automatic scene recognition is still regarded as an open challenge, even though there are reports of outperforming human accuracy. This is specially true for indoor scenes, since they can be well represented by their composing objects, which is highly variable information. Objects vary in angle, size, texture, besides being often partially occluded on crowded scenes. Even though Convolutional Neural Networks showed remarkable performance for most image-related problems, for indoor scenes the top performances were attributed to approaches that added object-level information to the methodology, modeling their intricate relationship. Knowing that Recurrent Neural Networks were designed to model structure from a given sequence of elements, only recently researchers started exploiting its advantages applied to the problem of scene recognition. Even though such works are usually below the state of the art performance, there is still plenty of room to unravel the full potential of recurrent methodologies. Thus, this work proposes representing an image as a sequence of object-level information, extracting highly semantic features from models pre-trained on an object-centric dataset, in order to feed a bidirectional Long Short-Term Memory network trained for scene classification. We perform a Many-to-Many training approach, such that each input outputs a corresponding scene prediction, allowing us to use each individual prediction to boost recognition with a weighted voting approach. To the best of our knowledge, our sequence representation, as well as our late fusion of predictions was little pursued by methods from the literature based on recurrent approaches for scene recognition. We evaluated our proposal on three widely known datasets for scene recognition: Scene15, MIT67 and SUN397, outperforming recurrent-based methods on MIT67, a dataset entirely dedicated to the problem of indoor scenes, while the others, which mix indoor and outdoor environments presented as a greater challenge for our approach. However, we were able to improve performance on all datasets over the most successful methods on the literature by pairing our work to a few of them in an ensemble of classifiers. Meaning a joint strategy with our method was beneficial for the task of scene classification.