A visual system is a key tool for automatic fruit harvesting. It faces the challenges of varied occlusion and illumination in the field, resulting in difficulties in fruit recognition and picking. At present, there are many studies that lack a comprehensive analysis of the impact of the environment on harvesting. This study proposes an object–environment fusion visual system. It comprises modules for object perception, environment perception, and picking pose estimation. The object perception module aims to identify and locate pears. The environment perception module is deployed to analyze the three-dimensional (3D) information of objects and obstacles. Finally, the picking pose estimation module fuses the information of the objects and environment to calculate the collision-free picking position and orientation. Additionally, specific implementations are employed for each module. It compares three networks to identify pears for object perception. Voxel-based representation is used to simplify point clouds for environmental perception. A sampler and evaluator are applied for picking pose estimation. The S1 and S2 datasets were acquired in a laboratory pear tree model and the orchard of Zhejiang Academy of Agricultural Sciences, respectively, to evaluate the proposed method. In the S2 dataset, the success rate for picking pose estimation reached 87.11% within the distance range of 30∼50 cm. All of the results demonstrate that the proposed method can be applied to visual perception for automatic pear harvesting.