Autoencoder-based background reconstruction and foreground segmentation with background noise estimation

Sauvalle, Bruno; Fortelle, Arnaud de La

doi:10.48550/arxiv.2112.08001

Cited by 1 publication

(2 citation statements)

References 36 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It appears however that the interaction between these two models during training is a very challenging issue, because of the competition between them to reconstruct the image. We handle this problem by training the background model before the foreground model: We observe that the AE-NE model [40], which is dedicated to dynamic background reconstruction and segmentation, is trained without any foreground reconstruction module, and is able to perform an accurate background reconstruction and segmentation not only on videos, but also on frame sequences which are not organized as videos. We then use this model as a pre-trained separate module: This background model is first trained independently from the other parts of the model, and the weights of this background model are then frozen during the training of the foreground model which is described below.…”

Section: A Separate Pre-trained Model For Background Reconstructionmentioning

confidence: 99%

“…3 on ObjectsRoom, 6 on ShapeStacks and 10 on CLEVRTEX). On CLEVR, which shows a fixed background, we reduce the number of background training iterations from 500 000 to 2500, as recommended in the AE-NE paper [40], and decrease the fixed background accuracy threshold τ since the background reconstruction is far more accurate when the background is fixed. We use isotropic scaling since all objects have similar widths and heights in these datasets.…”

Section: Quantitative Evaluation On Synthetic Benchmarksmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Multi-object Segmentation Using Attention and Soft-argmax

Sauvalle¹,

Fortelle²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation, which uses an attention mechanism to associate a feature vector to each object present in the scene and to predict the coordinates of these objects using soft-argmax. A transformer encoder handles occlusions and redundant detections, and a separate pre-trained background model is in charge of background reconstruction. We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks and provide examples of applications to real-world traffic videos.Preprint. Under review.

show abstract

Section: A Separate Pre-trained Model For Background Reconstructionmentioning

confidence: 99%

Section: Quantitative Evaluation On Synthetic Benchmarksmentioning

confidence: 99%