“…Even simple architectures have been shown to solve this task with a low error rate [15]. Recent advances are due to combining the 2D and 3D tasks into a joint estimation [4], [32], [33], [34] and using weakly [35], [36], [37], [38], [39] or self-supervised losses [40], [41], [42], [43], [44] or mixing 2D and 3D data for training [6], [42], [45], [46].…”