Text detection enables us to extract rich information from images. In this paper, we focus on how to generate bounding boxes that are appropriate to grasp text areas on books to help implement automatic text detection. We attempt not to improve a learning-based model by training it with an enough amount of data in the target domain but to leverage it, which has been already trained with another domain data. We develop algorithms that construct the bounding boxes by improving and leveraging the results of a learning-based method. Our algorithms can utilize different learning-based approaches to detect scene texts. Experimental evaluations demonstrate that our algorithms work well in various situations where books are roughly placed.There have been many studies of identifying books and text detection on them [3,4,5,6]. For example, [6] exploited line extractions based on the Hough transform. However, such work may fail to find text areas exactly when a book has a title of many words. Recently, text detection from natural scenes has intensely been developed as learning-based methods [7,8,9,10,11,12,13]. We consider leveraging learning-based methods for detecting texts in natural scene images in the text detection on books.Note that we do not attempt to improve the learning-based methods by training them with a good set of data but adopt them with additional refinement processes. This intention comes from the fact that it is often difficult to collect enough amount of data in the target domain and make them to be used for training. Recently, transfer learning has intensely been studied [14,15]. Transfer learning is the technology that re-learns only a part of the parameters of a model trained in another domain. In this paper, we exploit trained models instead of re-learning. We develop algorithms that construct bounding boxes for text areas by improving and leveraging the results of a learning-based method.This paper is based on our previous work [1], in which a convolutional neural network (CNN)-based method called R 2 CNN [8] was used to develop concrete discussions. In this paper, we additionally use another learning-based method called EAST [7], which is based on a fully convolutional network model and achieves a fast and accurate scene text detection pipeline. Figure 1 shows an overview of our study where an image is processed first by a learning-based method, i.e., R 2 CNN or EAST, and the result is then processed by our proposal to get the final result of text detection.The main contributions of this paper are as follows: