It is an important task for a robot to bring objects requested by human via voice. In order to achieve the task, object recognition using speech and images is needed. Ozasa et al. has proposed the method for the object recognition by integrating speech and image information. Although this method requires both speech (word) and image models, the speech models are automatically constructed by combining phonemic acoustic models according to the dictionary. However, the image models have to be constructed manually in advance. In this paper, instead of the manual construction of the image models, we propose an automatic image model construction method for object recognition using Web images. The effectiveness of the proposed method is verified in the object recognition by integrating speech and image information.