Abnormal falls in public places have significant safety hazards and can easily lead to serious consequences, such as trampling by people. Vision-driven fall event detection has the huge advantage of being non-invasive. However, in actual scenes, the fall behavior is rich in diversity, resulting in strong instability in detection. Based on the study of the stability of human body dynamics, the article proposes a new model of human posture representation of fall behavior, called the “five-point inverted pendulum model”, and uses an improved two-branch multi-stage convolutional neural network (M-CNN) to extract and construct the inverted pendulum structure of human posture in real-world complex scenes. Furthermore, we consider the continuity of the fall event in time series, use multimedia analytics to observe the time series changes of human inverted pendulum structure, and construct a spatio-temporal evolution map of human posture movement. Finally, based on the integrated results of computer vision and multimedia analytics, we reveal the visual characteristics of the spatio-temporal evolution of human posture under the potentially unstable state, and explore two key features of human fall behavior: motion rotational energy and generalized force of motion. The experimental results in actual scenes show that the method has strong robustness, wide universality, and high detection accuracy.