In supervised learning, deep learning models demand a large corpus of annotated data for object detection and classification tasks. This constrains their utility in humanitarian emergency response. To overcome this problem, we have proposed an unsupervised dwelling counting from very high-resolution satellite imagery by combining a Variational Autoencoder(VAE) with an anomaly detection approach. When VAEs are applied in earth observation for dwelling localization and counting, we observed two critical limitations (1) the balance between reconstruction and good latent code, where in-favour of good reconstruction of dwellings leads to weak anomaly score maps that fail to properly localize dwellings (2) limited spatiotemporal invariance of the learned latent code. When the model is trained with datasets obtained from different geography and time, it fails to properly localize dwellings. For the first problem, we introduced self-supervision by creating synthetic anomalies. For the second problem, we introduced latent space conditioning. The approach is tested on 9 very high-resolution images obtained from six Forcibly Displaced People settlement areas. Results indicate that combining VAE with an anomaly detection approach has reached an AUC value ranging from 0.70 at complex settlements towards 0.98 at relatively less complex settlement areas. Similarly, an MAE value of 56.67 towards 5.03 is achieved for dwelling counting. Joint training of combined datasets with latent space conditioning and self-supervision enabled the achievement of results better than classical VAE, with improved spatiotemporal transferability of the model with more crisp and strong anomaly maps. Overall implementation code will be available at https://github.com/getch-geohum/SSL-VAE.