Image composition extracts the content of interest (COI) from a source image and blends it into a target image to generate a new image. In the majority of existing works, the COI is manually extracted and then overlaid on top of the target image. However, in practice, it is often necessary to deal with situations in which the COI is partially occluded by the target image content. In this regard, both tasks of extracting the COI and cropping its occluded part require intensive user interactions, which are laborious and seriously reduce the composition efficiency. This paper addresses the aforementioned challenges by proposing an efficient image composition method. First, we extract the semantic contents of the images by using state‐of‐the‐art deep learning methods. Therefore, the COI can be selected with clicks only, which can greatly reduce the demanded user interactions. Second, according to the user's operations (such as translation or scale) on the COI, we can effectively infer the occlusion relationships between the COI and the contents of the target image. Thus, the COI can be adaptively embedded into the target image without concern about cropping its occluded part. Therefore, the procedures of content extraction and occlusion handling can be significantly simplified, and work efficiency is remarkably improved. Experimental results show that compared to existing works, our method can reduce the number of user interactions to approximately one‐tenth and increase the speed of image composition by more than ten times.