“…based on various input modalities, including stereo images [3,24,32,41,51,40], RGB-D pairs [31,39,45,33], or Lidar points [28,18,54,56,38,55,12,7,11,52]. These methods, however, either require strict sensor calibrations (e.g., stereo-based), or expensive devices (e.g., RGB-D or Lidar-based) for achieving satisfactory performance, which restricts their widespread applications.…”