Background and Objectives: Current approaches in surgical skills assessment employ virtual reality simulators, motion sensors, and task-specific checklists. Although accurate, these methods may be complex in the interpretation of the generated measures of performance. The aim of this study is to propose an alternative methodology for skills assessment and classification, based on video annotation of laparoscopic tasks. Methods: Two groups of 32 trainees (students and residents) performed two laparoscopic tasks: peg transfer (PT) and knot tying (KT). Each task was annotated via a video analysis software based on a vocabulary of eight surgical gestures (surgemes) that denote the elementary gestures required to perform a task. The extracted metrics included duration/counts of each surgeme, penalty events, and counts of sequential surgemes (transitions). Our analysis focused on trainees’ skill level comparison and classification using a nearest neighbor approach. The classification was assessed via accuracy, sensitivity, and specificity. Results: For PT, almost all metrics showed significant performance difference between the two groups ( p < 0.001). Residents were able to complete the task with fewer, shorter surgemes and fewer penalty events. Moreover, residents performed significantly fewer transitions ( p < 0.05). For KT, residents performed two surgemes in significantly shorter time ( p < 0.05). The metrics derived from the video annotations were also able to recognize the trainees’ skill level with 0.71 – 0.86 accuracy, 0.80 – 1.00 sensitivity, and 0.60 – 0.80 specificity. Conclusion: The proposed technique provides a tool for skills assessment and experience classification of surgical trainees, as well as an intuitive way for describing what and how surgemes are performed.
Background We present an artificial intelligence framework for vascularity classification of the gallbladder (GB) wall from intraoperative images of laparoscopic cholecystectomy (LC). Methods A two‐stage Multiple Instance Convolutional Neural Network is proposed. First, a convolutional autoencoder is trained to extract feature representations from 4585 patches of GB images. The second model includes a multi‐instance encoder that fetches random patches from a GB region and outputs an equal number of embeddings that feed a multi‐input classification module, which employs pooling and self‐attention mechanisms, to perform prediction. Results The evaluation was performed on 234 GB images of low and high vascularity from 68 LC videos. Thorough comparison with various state‐of‐the‐art multi‐instance and single‐instance learning algorithms was performed for two experimental tasks: image‐ and video‐level classification. The proposed framework shows the best performance with accuracy 92.6%–93.2% and F1 93.5%–93.9%, close to the agreement of two expert evaluators (94%). Conclusions The proposed technique provides a novel approach to classify LC operations with respect to the vascular pattern of the GB wall.
In this study, we propose a deep learning framework and a self-supervision scheme for video-based surgical gesture recognition. The proposed framework is modular. First, a 3D convolutional network extracts feature vectors from video clips for encoding spatial and short-term temporal features. Second, the feature vectors are fed into a transformer network for capturing long-term temporal dependencies. Two main models are proposed, based on the backbone framework: C3DTrans (supervised) and SSC3DTrans (self-supervised). The dataset consisted of 80 videos from two basic laparoscopic tasks: peg transfer (PT) and knot tying (KT). To examine the potential of self-supervision, the models were trained on 60% and 100% of the annotated dataset. In addition, the best-performing model was evaluated on the JIGSAWS robotic surgery dataset. The best model (C3DTrans) achieves an accuracy of 88.0%, a 95.2% clip level, and 97.5% and 97.9% (gesture level), for PT and KT, respectively. The SSC3DTrans performed similar to C3DTrans when training on 60% of the annotated dataset (about 84% and 93% clip-level accuracies for PT and KT, respectively). The performance of C3DTrans on JIGSAWS was close to 76% accuracy, which was similar to or higher than prior techniques based on a single video stream, no additional video training, and online processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.