Real-world scenes are typically complex and cluttered, but we are able to explore them and recognize individual objects rapidly and effortlessly. A growing body of evidence indicates that semantic relations present in scenes play a key role in the efficient recognition of objects. However, to what extent semantic relations are able to guide spatial attention in an automatic manner remains a matter of debate. Considering that operation of spatial attention can be understood as a sequence of shifts, engagements, and disengagements, semantic relations might affect each stage of this cycle differently. Therefore, the present study was designed to investigate whether objects that violate semantic rules engage and hold attention for a longer time than objects that are expected in a given context. To this end, we used a paradigm involving a central presentation of a distractor scene, which comprised either a semantically congruent or an incongruent object, and a peripheral presentation of a small target letter. Importantly, the experiment included a “positive control” condition, in which disgust-evoking scenes (which had been shown to robustly hold attention) were presented in the same way and compared to happiness-evoking images. We found that semantically incongruent scenes did not delay responses to the peripheral target, in comparison to semantically congruent ones, which indicates that they did not hold attention for a longer time. At the same time, we did find an attention-hold effect caused by disgusting scenes, which confirms the procedure we used was sensitive enough to detect such an effect. Therefore, by providing evidence that objects violating semantic composition of a scene do not hold spatial attention automatically, our study contributes to a better understanding of how attention operates in complex, naturalistic settings.