Live imaging techniques, such as two-photon imaging, promise novel insights into cellular activity patterns at a high spatial and temporal resolution. While current deep learning approaches typically focus on specific supervised tasks in the analysis of such data, e.g., learning a segmentation mask as a basis for subsequent signal extraction steps, we investigate how unsupervised generative deep learning can be adapted to obtain interpretable models directly at the level of the video frames. Specifically, we consider variational autoencoders for models that infer a compressed representation of the data in a low-dimensional latent space, allowing for insight into what has been learned. Based on this approach, we illustrate how structural knowledge can be incorporated into the model architecture to improve model fitting and interpretability. Besides standard convolutional neural network components, we propose an architecture for separately encoding the foreground and background of live imaging data. We exemplify the proposed approach with two-photon imaging data from hippocampal CA1 neurons in mice, where we can disentangle the neural activity of interest from the neuropil background signal. Subsequently, we illustrate how to impose smoothness constraints onto the latent space for leveraging knowledge about gradual temporal changes. As a starting point for adaptation to similar live imaging applications, we provide a Jupyter notebook with code for exploration. Taken together, our results illustrate how architecture choices for deep generative models, such as for spatial structure, foreground vs. background, and gradual temporal changes, facilitate a modeling approach that combines the flexibility of deep learning with the benefits of incorporating domain knowledge. Such a strategy is seen to enable interpretable, purely image-based models of activity signals from live imaging, such as for two-photon data.