Tuberculosis is one of the most serious infectious diseases, and its treatment is highly dependent on early detection. Microscopybased analysis of sputum images for bacilli identification is a common technique used for both diagnosis and treatment monitoring. However, it a challenging process since sputum analysis requires time and highly trained experts to avoid potentially fatal mistakes. Capturing fields of view (FOVs) from high resolution whole slide images is a laborious procedure, since they are manually localized and then examined to determine the presence of bacteria. In the present paper we propose a method that automates the process, thus greatly reducing the amount of human labour. In particular, we (i) describe an image processing based method for the extraction of a FOV representation which emphasises salient, bacterial content, while suppressing confounding visual information , and (ii) introduce a novel deep learning based architecture which learns from coarsely labelled FOV images and the corresponding binary masks, and then classifies novel FOV images as salient (bacteria containing) or not. Using a real-world data corpus, the proposed method is shown to outperform 12 state of the art methods in the literature, achieving (i) an approximately 10% lower overall error rate than the next best model and (ii) perfect sensitivity (7% higher than the next best model).