“…While previous multi-person methods perform well in constrained experimental settings, they struggle with severe occlusion, diverse body size and appearance, the ambiguity of monocular depth, and in-the-wild cases [10,21,34,39,41]. These challenges lead to unsatisfactory performance in crowded scenes, including detection misses, similar predictions for overlapping people, and all predictions having a similar height.…”