The current state-of-the-art, in terms of performance, for solving document image binarization is training artificial neural networks on pre-labelled ground truth data. As such, it faces the same issues as other, more conventional, classification problems; requiring a large amount of training data. However, unlike those conventional classification problems, document image binarization involves having to either manually craft or estimate the binarized ground truth data, which can be error-prone and time-consuming. This is where sample selection, the act of selecting training samples based on some method or metric, might help. By reducing the size of the training dataset in such a way that the binarization performance is not impacted, the required time spent creating the ground truth is also reduced. This thesis proposes a clusterbased sample selection method, based on previous work, that uses image similarity metrics and the relative neighbourhood graph to reduce the underlying redundancy of the dataset. The method is implemented with different clustering methods and similarity metrics for comparison, with the best implementation being based on affinity propagation and the structural similarity index. This implementation manages to reduce the training dataset by 46% while maintaining a performance that is equal to that of the complete dataset. The performance of this method is shown to not be significantly different from randomly selecting the same number of samples. However, due to limitations in the random method, such as unpredictable performance and uncertainty in how many samples to select, the use of sample selection in document image binarization still shows great promise.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.