While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned features derived from visible objects, location of hands, as well as optical flow.Given a sparse optical flow volume as input, our CNN classifies the camera wearer's activity. We obtain classification accuracy of 89%, which outperforms the current state-of-the-art by 19%. Additional evaluation is performed on an extended egocentric video dataset, classifying twice the amount of categories than current state-of-the-art. Furthermore, our CNN is able to recognize whether a video is egocentric or not with 99.2% accuracy, up by 24% from current state-of-the-art. To better understand what the network actually learns, we propose a novel visualization of CNN kernels as flow fields.
While egocentric cameras like GoPro are gaining popularity, the videos they capture are long, boring, and difficult to watch from start to end. Fast forwarding (i.e. frame sampling) is a natural choice for faster video browsing. However, this accentuates the shake caused by natural head motion, making the fast forwarded video useless.We propose EgoSampling, an adaptive frame sampling that gives more stable fast forwarded videos. Adaptive frame sampling is formulated as energy minimization, whose optimal solution can be found in polynomial time.In addition, egocentric video taken while walking suffers from the left-right movement of the head as the body weight shifts from one leg to another. We turn this drawback into a feature: Stereo video can be created by sampling the frames from the left most and right most head positions of each step, forming approximate stereo-pairs.
The possibility of sharing one's point of view makes use of wearable cameras compelling. These videos are often long, boring and coupled with extreme shake, as the camera is worn on a moving person. Fast forwarding (i.e. frame sampling) is a natural choice for quick video browsing. However, this accentuates the shake caused by natural head motion in an egocentric video, making the fast forwarded video useless. We propose EgoSampling, an adaptive frame sampling that gives stable, fast forwarded, hyperlapse videos. Adaptive frame sampling is formulated as an energy minimization problem, whose optimal solution can be found in polynomial time. We further turn the camera shake from a drawback into a feature, enabling the increase in field-of-view of the output video. This is obtained when each output frame is mosaiced from several input frames. The proposed technique also enables the generation of a single hyperlapse video from multiple egocentric videos, allowing even faster video consumption.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.