Abstract. Traditional Hough transform-based methods detect objects by casting votes to object centroids from object patches. It is difficult to disambiguate object patches from the background by a classifier without contextual information, as an image patch only carries partial information about the object. To leverage the contextual information among image patches, we capture the contextual relationships on image patches through a conditional random field (CRF) with latent variables denoted by locality-constrained linear coding (LLC). The strength of the pairwise energy in the CRF is measured using a Gaussian kernel. In the training stage, we modulate the visual codebook by learning the CRF model iteratively. In the test stage, the binary labels of image patches are jointly estimated by the CRF model. Image patches labeled as the object category cast weighted votes for object centroids in an image according to the LLC coefficients. Experimental results on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets demonstrate the effectiveness of the proposed method compared with other Hough transform-based methods. © The Authors. Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.