We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.
In this paper, a modified adaptive K-means (MAKM) method is proposed to extract the region of interest (ROI) from the local and public datasets. The local image datasets are collected from Bethezata General Hospital (BGH) and the public datasets are from Mammographic Image Analysis Society (MIAS). The same image number is used for both datasets, 112 are abnormal and 208 are normal. Two texture features (GLCM and Gabor) from ROIs and one CNN based extracted features are considered in the experiment. CNN features are extracted using Inception-V3 pre-trained model after simple preprocessing and cropping. The quality of the features are evaluated individually and by fusing features to one another and five classifiers (SVM, KNN, MLP, RF, and NB) are used to measure the descriptive power of the features using cross-validation. The proposed approach was first evaluated on the local dataset and then applied to the public dataset. The results of the classifiers are measured using accuracy, sensitivity, specificity, kappa, computation time and AUC. The experimental analysis made using GLCM features from the two datasets indicates that GLCM features from BGH dataset outperformed that of MIAS dataset in all five classifiers. However, Gabor features from the two datasets scored the best result with two classifiers (SVM and MLP). For BGH and MIAS, SVM scored an accuracy of 99%, 97.46%, the sensitivity of 99.48%, 96.26% and specificity of 98.16%, 100% respectively. And MLP achieved an accuracy of 97%, 87.64%, the sensitivity of 97.40%, 96.65% and specificity of 96.26%, 75.73% respectively. Relatively maximum performance is achieved for feature fusion between Gabor and CNN based extracted features using MLP classifier. However, KNN, MLP, RF, and NB classifiers achieved almost 100% performance for GLCM texture features and SVM scored an accuracy of 96.88%, the sensitivity of 97.14% and specificity of 96.36%. As compared to other classifiers, NB has scored the least computation time in all experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.