Explainability for machine learning gets more and more important in high-stakes decisions like real estate appraisal. While traditional hedonic house pricing models are fed with hard information based on housing attributes, recently also soft information has been incorporated to increase the predictive performance. This soft information can be extracted from image data by complex models like Convolutional Neural Networks (CNNs). However, these are intransparent which excludes their use for high-stakes financial decisions. To overcome this limitation, we examine if a two-stage modeling approach can provide explainability. We combine visual interpretability by Regression Activation Maps (RAM) for the CNN and a linear regression for the overall prediction. Our experiments are based on 62.000 family homes in Philadelphia and the results indicate that the CNN learns aspects related to vegetation and quality aspects of the house from exterior images, improving the predictive accuracy of real estate appraisal by up to 5.4%.