The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed.