People can describe spatial scenes with language and, vice versa, create images based on linguistic descriptions. However, current systems do not even come close to matching the complexity of humans when it comes to reconstructing a scene from a given text. Even the ever-advancing development of better and better Transformer-based models has not been able to achieve this so far. This task, the automatic generation of a 3D scene based on an input text, is called text-to-3D scene generation. The key challenge, and focus of this dissertation, now relate to the following topics: (a) Analyses of how well current language models understand spatial information, how static embeddings compare, and whether they can be improved by anaphora resolution. (b) Automated resource generation for context expansion and grounding that can help in the creation of realistic scenes. (c) Creation of a VR-based text-to-3D scene system that can be used as an annotation and active-learning environment, but can also be easily extended in a modular way with additional features to solve more contexts in the future. (d) Analyze existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies in the context of VR. In the first part of this work, we could show that static word embeddings do not benefit significantly from pronoun substitution. We explain this result by the loss of contextual information, the reduction in the relative occurrence of rare words, and the absence of pronouns to be substituted. But we were able to we have shown that both static and contextualizing language models appear to encode object knowledge, but require a sophisticated apparatus to retrieve it. The models themselves in combination with the measures differ greatly in terms of the amount of knowledge they allow to extract. Classifier-based variants perform significantly better than the unsupervised methods from bias research, but this is also due to overfitting. The resources generated for this evaluation are later also an important component of point three. In the second part, we present AffordanceUPT, a modularization of UPT trained on the HICO-DET dataset, which we have extended with Gibsonien/telic annotations. We then show that AffordanceUPT can effectively make the Gibsonian/telic distinction and that the model learns other correlations in the data to make such distinctions (e.g., the presence of hands in the image) that have important implications for grounding images to language. The third part first presents a VR project to support spatial annotation respectively IsoSpace. The direct spatial visualization and the immediate interaction with the 3D objects should make the labeling more intuitive and thus easier. The project will later be incorporated as part of the Semantic Scene Builder (SeSB). The project itself in turn relies on the Text2SceneVR presented here for generating spatial hypertext, which in turn is based on the VAnnotatoR. Finally, we introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using Semantic Annotation Framework (SemAF) as a scheme for annotating semantic relations. It integrates a wide range of tools and resources by utilizing SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions and also supports annotations. When evaluating SeSB against another state-of-the-art tool, it was found that our approach not only performed better, but also allowed us to model a wider variety of scenes. The final part reviews existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies needed to make the most of technological opportunities in the future.