Recent advances in machine learning techniques for image classification have led to the development of robust approaches to both object detection and extraction. Traditional CNN architectures, such as LeNet, AlexNet and CaffeNet, usually use as input images of fixed sizes taken from objects and attempt to assign labels to those images. Another possible approach is the Fast Region-based CNN (or Fast R-CNN), which works by using two models: (i) a Region Proposal Network (RPN) which generates a set of potential Regions of Interest (RoI) in the image; and (ii) a traditional CNN which assigns labels to the proposed RoI. As an alternative, this study proposes an approach to automatic object extraction from aerial images similar to the Fast R-CNN architecture, the main difference being the use of the Simple Linear Iterative Clustering (SLIC) algorithm instead of an RPN to generate the RoI. The dataset used is composed of high-resolution aerial images and the following classes were considered: house, sport court, hangar, building, swimming pool, tree, and street/road. The proposed method can generate RoI with different sizes by running a multi-scale SLIC approach. The overall accuracy obtained for object detection was 89% and the major advantage is that the proposed method is capable of semantic segmentation by assigning a label to each selected RoI. Some of the problems encountered are related to object proximity, in which different instances appeared merged in the results.