In this paper, we present a video-based method of detecting fall incidents of the elderly living alone. We propose using the measures of humans' heights and occupied areas to distinguish three typical states of humans: standing, sitting, and lying. Two relatively orthogonal views are utilized, in turn, simplifying the estimation of occupied areas as the product of widths of the same person, observed in two cameras. However, the feature estimation based on sizes of silhouettes varies across the viewing window due to the camera perspective. To deal with it, we suggest using Local Empirical Templates (LET) that are defined as the sizes of standing people in local image patches. Two important characteristics of LET are: (1) LET in unknown scenes can be easily extracted by an automatic manner, and (2) by its nature, LET hold the perspective information that can be used for feature normalization. The normalization process is not only to cancel the perspective but also to take the features of standing people as the baselines. We realize that heights of standing people are greater than that of sitting and lying people. People in standing states also occupy smaller areas than whom in sitting and lying states. Thus, three humans' states fall into three separable regions of the proposed feature space, composing of normalized heights and normalized occupied areas. Fall incidents can be inferred from time-series analysis of human state transition. We test the performance of our method on 24 video samples in Multi-view Fall Dataset (1) leading to high detection rates and low false alarms, which outperform the state-of-the-art methods (2) (3) tested on the same benchmark dataset.