Within the last few years, medical deep learning has experienced rapid advances, especially in the area of image interpretation. Using whole-slide images of histological lymph node sections, this thesis develops deep learning models and techniques for detecting and classifying breast cancer metastases. Pathologists would normally have to perform extensive microscopic assessments on this clinically relevant task. In patients with breast cancer, metastases in lymph nodes have therapeutic implications. So, an automated solution may remarkably reduce pathologists’ workload while at the same time reducing the subjectivity of their diagnoses. There are, however, still several significant challenges associated with the development and translation of medical deep learning systems. To begin with, developing large and well-annotated datasets is expensive and most often, the labels are imbalanced. In addition, it is difficult to transfer the performance of deep learning algorithms from one dataset and setting to another due to domain shift issues. Lastly, the results of deep learning systems should be comprehensible and applicable to clinical datasets. In order to enhance effectiveness in an unknown target domain and increase generalization, this thesis assesses ensemble learning by transferring prior knowledge from non-medical and medical sources. Although deep learning methods may be successful, they may not perform well in a clinical workflow. Many datasets are constructed from millions of patches of images, leading to a data curation bias; others only contain annotations at the slide-level, making it difficult to detect errors at the local level as long as the results are correct. As a way of alleviating the class imbalance and biased training data, this thesis proposes a cluster-based sampling method for whole-slide histopathology image analysis. With the proposed ensemble learning and sampling methods, cutting-edge machine learning architectures can be extended, and state-of-the-art performance can be achieved for both diagnostic test images and whole slide images.
Results in the sampling part show that using the same dataset, the results are roughly the same, regardless of sampling methods. However, the result of the unseen dataset falls dramatically when it is sampled randomly. By sampling separately, we reduce the probability of drops in the results for unseen datasets. Random selection gives a sensitivity of 0.82 on an unseen dataset, however sampling separately gives a sensitivity of 0.93. In random selection, dice index similarity is 0.83, whereas in sampling separately, dice index similarity is 0.90. Also, observing the slide level results in St. Michael’s hospital dataset, it can be concluded that we cannot decide about the best model and all the models give good results at least in one slide. So, combining the results from multiple configurations would give a more reliable and consistent result.