LAEO-Net: Revisiting People Looking at Each Other in Videos

Marín-Jiménez, Manuel J.; Kalogeiton, Vicky; Medina-Suarez, Pablo; Zisserman, Andrew

doi:10.1109/cvpr.2019.00359

Cited by 52 publications

(35 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these results highlight Fig. 7: Comparison between detections found with the model in the work by Marín-Jiménez et al [28] (left) and the Viola-Jones detector used by Patacchiola et al (right) Notice the difference in precision, making the CNN-based head detector much more suitable for this task than the Viola-Jones detector (best viewed on digital format).…”

Section: Comparison To Prior Workmentioning

confidence: 87%

“…For our testing procedure, we have used one of the models provided by the authors in the Hopenet repository [40] (300W-LP, alpha 1, robust to image quality); this model has been chosen as it should be the most suitable to be applied over real-world pictures as the ones appearing in the AFLW dataset. The input images correspond to the portion of the AFLW dataset used to test our model; they have been obtained using the head detector in the work by Marín-Jiménez et al [28], but they have been resized to 224 × 224; also, as the ResNet50 model uses color pictures as input, the pictures have not been converted to grayscale.…”

Section: Comparison To Prior Workmentioning

confidence: 99%

“…Human head pose estimation is useful in many situations: for instance in vehicles (detecting if the driver of a vehicle is paying attention to the road [31]), human-computer interaction (detecting where the user's attention is being drawn [44]), social interaction understanding (detecting if people is looking at each other [28]), video surveillance systems [18,36] or to aid various aerial cinematography tasks [33].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Berral-Soler

Madrid-Cuevas

Muñoz-Salinas

et al. 2020

Neural Comput & Applic

View full text Add to dashboard Cite

Human head pose estimation in images has applications in many fields such as human-computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: 'Pointing'04' (aiming at covering a wide range of poses) and 'Annotated Facial Landmarks in the Wild' (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (4.4°average error on the test partition). Also, given its low inference time (6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU).

show abstract

Section: Comparison To Prior Workmentioning

confidence: 87%

Section: Comparison To Prior Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Berral-Soler

Madrid-Cuevas

Muñoz-Salinas

et al. 2020

Neural Comput & Applic

View full text Add to dashboard Cite

show abstract

“…A convolutional neural network (CNN) single-shot detector (SSD) [80] is used for head detection [81] in the images (Figure 2). The model adopted in this research was developed by the authors of LAEO-Net [82]. The model's suitability for the task was evaluated by manually revising the head detection results on the collected dataset of 19 videos.…”

Section: Head Detectionmentioning

confidence: 99%

“…The numbers on the boxes specify the shape of the feature layers. The model adopted in this research is developed by the authors of LAEO-Net [82].…”

Section: Head Detectionmentioning

confidence: 99%

Three-Dimensional Human Head Reconstruction Using Smartphone-Based Close-Range Video Photogrammetry

Matuzevičius

Serackis

2021

Applied Sciences

View full text Add to dashboard Cite

Creation of head 3D models from videos or pictures of the head by using close-range photogrammetry techniques has many applications in clinical, commercial, industrial, artistic, and entertainment areas. This work aims to create a methodology for improving 3D head reconstruction, with a focus on using selfie videos as the data source. Then, using this methodology, we seek to propose changes for the general-purpose 3D reconstruction algorithm to improve the head reconstruction process. We define the improvement of the 3D head reconstruction as an increase of reconstruction quality (which is lowering reconstruction errors of the head and amount of semantic noise) and reduction of computational load. We proposed algorithm improvements that increase reconstruction quality by removing image backgrounds and by selecting diverse and high-quality frames. Algorithm modifications were evaluated on videos of the mannequin head. Evaluation results show that baseline reconstruction is improved 12 times due to the reduction of semantic noise and reconstruction errors of the head. The reduction of computational demand was achieved by reducing the frame number needed to process, reducing the number of image matches required to perform, reducing an average number of feature points in images, and still being able to provide the highest precision of the head reconstruction.

show abstract

Transformers and Visual Transformers

et al. 2023

View full text Add to dashboard Cite

Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common transformer architecture uses only the transformer encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional transformer architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some improvements of visual transformers to account for small datasets or less computation (Subheading 3). Finally, we introduce visual transformers applied to tasks other than image classification, such as detection, segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or multimodality using text or audio data (Subheading 5).

show abstract

LAEO-Net: Revisiting People Looking at Each Other in Videos

Cited by 52 publications

References 25 publications

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Three-Dimensional Human Head Reconstruction Using Smartphone-Based Close-Range Video Photogrammetry

Transformers and Visual Transformers

Contact Info

Product

Resources

About