We present an approach for automatically devising object annotations in images. Thus, given a set of images which are known to contain a common object, our goal is to find a bounding box for each image which tightly encloses the object. In contrast to regular object detection, we do not assume any previous manual annotations except for binary global image labels. We first use a discriminative color model for initializing our algorithm by very coarse bounding box estimations. We then narrow down these boxes using visual words computed from HOG features. Finally, we apply an iterative algorithm which trains a SVM model based on bag-of-visual-words histograms. During each iteration, the model is used to find better bounding boxes which can be done efficiently by branch and bound. The new bounding boxes are then used to retrain the model. We evaluate our approach for several different classes of publicly available datasets and show that we obtain promising results.