This dissertation proposes a novel framework for recovering relative depth maps from a video.
The framework is composed of two parts: a depth estimator and a sparse label interpolator. These parts are completely separate from one another and can operate independently. Prior methods have tended to heavily couple the interpolation stage with the depth estimation, which can assist with automation at the expense of flexibility. The loss of this flexibility can in fact be worse than any advantage gained by coupling the two stages together. This dissertation shows how by treating the two stages separately, it is very easy to change the quality of the results with little effort. It also leaves room for other adjustments.
The depth estimator is based upon well-established computer vision principles and only has the restriction that the camera must be moving in order to obtain depth estimates. By starting from first principles, this dissertation has developed a new approach for quickly estimating relative depth. That is, it is able to answer the question, “is this feature closer than another," with relatively little computational overhead. The estimator is designed using a pipeline-style approach so that it produces sparse depth estimates in an online fashion; i.e. a depth estimate is automatically available for each new frame presented to the estimator.
Finally, the interpolator applies an existing method based upon edge-aware filtering to generate
the final depth maps. When temporal filters are used, the interpolation stage is able to very easily handle frames without any depth information, such as when the camera was stationary. However, unlike the prior work, this dissertation establishes the theoretical background for this type of interpolation and addresses some of the associated numerical problems. Strategies for dealing with these issues have also been provided