The detection and pose estimation of objects in human demonstrations remain challenging yet crucial tasks. The increasing availability of red-green-blue and depth sensors makes it possible to synthetize local features of color and three-dimensional (3D) geometry, which are useful for processing a wider range of objects. However, existing methods fail to combine the inherent advantages of these two features. Moreover, pose refinement methods based on whole point clouds are often affected by occlusion and background noise. In this paper, feature points of the speeded-up robust feature and the fast point feature histogram were transformed into the same 3D space. After matching them separately, multimodal feature points were jointly used to estimate a coarse pose. Subsequently, the coarse pose was refined by aligning point clouds composed of feature points' neighboring patches. During the iterative closest point process, we selected corresponding points in matched local patches. In our first and second comparative experiments, F1 scores were respectively increased by 0.1349 and 0.1633, which verified the validity of our method. Finally, the third qualitative experiment showed that the proposed method is applicable to manipulated-object detection and pose estimation.