Computer vision provides a real-time, non-destructive, and indirect way of horticultural crop yield estimation. Deep learning helps improve horticultural crop yield estimation accuracy. However, the accuracy of current estimation models based on RGB (red, green, blue) images does not meet the standard of a soft sensor. Through enriching more data and improving the RGB estimation model structure of convolutional neural networks (CNNs), this paper increased the coefficient of determination (R2) by 0.0284 and decreased the normalized root mean squared error (NRMSE) by 0.0575. After introducing a novel loss function mean squared percentage error (MSPE) that emphasizes the mean absolute percentage error (MAPE), the MAPE decreased by 7.58%. This paper develops a lettuce fresh weight estimation method through the multi-modal fusion of RGB and depth (RGB-D) images. With the multimodal fusion based on calibrated RGB and depth images, R2 increased by 0.0221, NRMSE decreased by 0.0427, and MAPE decreased by 3.99%. With the novel loss function, MAPE further decreased by 1.27%. A MAPE of 8.47% helps to develop a soft sensor for lettuce fresh weight estimation.