Kenichiro FUKUSHI†a) and Itsuo KUMAZAWA † †b) , Members
SUMMARYIn this paper, we present a computer vision-based human tracking system with multiple stereo cameras. Many widely used methods, such as KLT-tracker, update the trackers "frame-to-frame," so that features extracted from one frame are utilized to update their current state. In contrast, we propose a novel optimization technique for the "multi-frame" approach that computes resultant trajectories directly from video sequences, in order to achieve high-level robustness against severe occlusion, which is known to be a challenging problem in computer vision. We developed a heuristic optimization technique to estimate human trajectories, instead of using dynamic programming (DP) or an iterative approach, which makes our method sufficiently computationally efficient to operate in realtime. Six video sequences where one to six people walk in a narrow laboratory space are processed using our system. The results confirm that our system is capable of tracking cluttered scenes in which severe occlusion occurs and people are frequently in close proximity to each other. Moreover, minimal information is required for tracking, instead of full camera images, which is communicated over the network. Hence, commonly used network devices are sufficient for constructing our tracking system. key words: human tracking, multi-view, multi-frame, stereo vision, depth camera, occlusion robust
IntroductionComputer vision-based human tracking has received increasing attention recently. Applications include ambient intelligence, human-computer interaction, human behavior analysis and security. Computer vision (CV) allows tracking systems that operate without such sensors as RFID, GPS, or smart phones. However, a problem known as "occlusion" in CV-based tracking exists, which occurs when a person being tracked goes behind other people or objects. Full occlusion disables any cues for tracking, and partial occlusion changes the appearance of the person, causing tracking difficulties.Earlier work has identified several promising strategies. First, multi-view tracking approaches are employed to reduce the blind areas caused by occlusion. Tracking or feature extraction is conducted for each camera, and then the final tracking result is produced by fusing the evidence from all of the views. The problem is how to match regions observed from different viewpoints. Researchers proposed