In recent years, with increase in concern about public safety and security, human movements or action sequences are highly valued when dealing with suspicious and criminal activities. In order to estimate the position and orientation related to human movements, depth information is needed. This is obtained by fusing data obtained from multiple cameras at different viewpoints. In practice, whenever occlusion occurs in a surveillance environment, there may be a pixel-to-pixel correspondence between two images captured from two cameras and, as a result, depth information may not be accurate. Moreover use of more than one camera exclusively adds burden to the surveillance infrastructure. In this study, we present a mathematical model for acquiring object depth information using single camera by capturing the in focused portion of an object from a single image. When camera is in-focus, with the reference to camera lens center, for a fixed focal length for each aperture setting, the object distance is varied. For each aperture reading, for the corresponding distance, the object distance (or depth) is estimated by relating the three parameters namely lens aperture radius, object distance and object size in image plane. The results show that the distance computed from the relationship approximates actual with a standard error estimate of 2.39 to 2.54, when tested on Nikon and Cannon versions with an accuracy of 98.1% at 95% confidence level.