Natural behavior is hierarchical. Yet, there is a paucity of benchmarks addressing this aspect. Recognizing the scarcity of large-scale hierarchical behavioral benchmarks, we create a novel synthetic basketball playing benchmark (Shot7M2). Beyond synthetic data, we extend BABEL into a hierarchical action segmentation benchmark (hBABEL). Then, we develop a masked autoencoder framework (hBehaveMAE) to elucidate the hierarchical nature of motion capture data in an unsupervised fashion. We find that hBehaveMAE learns interpretable latents on Shot7M2 and hBABEL, where lower encoder levels show a superior ability to represent fine-grained movements, while higher encoder levels capture complex actions and activities. Additionally, we evaluate hBehaveMAE on MABe22, a representation learning benchmark with short and long-term behavioral states. hBehaveMAE achieves state-of-the-art performance without domain-specific feature extraction. Together, these components synergistically contribute towards unveiling the hierarchical organization of natural behavior. Models and benchmarks are available at https://github.com/amathislab/BehaveMAE.