Variation in lighting conditions is a major cause of performance degradation in pattern recognition when using optical imaging. In this study, infrared (IR) and depth images were considered as possible robust alternatives against variations in illumination, particularly for improving the performance of automatic lip-reading. The variations due to lighting conditions were quantitatively analyzed for optical, IR, and depth images. Then, deep neural network (DNN)-based lip-reading rules were built for each image modality. Speech recognition techniques based on IR or depth imaging required an additional light source that emitted light in the IR range, along with a special camera. To mitigate this problem, we propose a method that does not use an IR/depth image directly, but instead estimates images based on the optical RGB image. To this end, a modified U-net was adopted to estimate the IR/depth image from an optical RGB image. The results show that the IR and depth images were rarely affected by the lighting conditions. The recognition rates for the optical, IR, and depth images were 48.29%, 95.76%, and 92.34%, respectively, under various lighting conditions. Using the estimated IR and depth images, the recognition rates were 89.35% and 80.42%, respectively. This was significantly higher than for the optical RGB images.