Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild

Wu, Weijia; Ning, Lu; Xie, Enze; Wang, Yuxing; Yu, Wenwen; Yang, Cheng; Zhou, Hong

doi:10.1007/978-3-030-69535-4_18

Cited by 15 publications

(8 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent methods [35,36,32,43,42,40,41] based on deep learning have been made tremendous progress for image-level text detection. CTPN [32] adopted Faster RCNN [25] and modified RPN to detect horizontal text.…”

Section: Text Detection and Trackingmentioning

confidence: 99%

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Wu¹,

Li²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText.

show abstract

Section: Text Detection and Trackingmentioning

confidence: 99%

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Wu¹,

Li²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Domain adaptation aims to reduce the domain gap between training and testing data. There are also some methods [19,20,21,22] to solve the domain adaptation problem in scene text detection. GA-DAN [21] converts a source-domain image into multiple images of different spatial views as in target domain.…”

Section: Related Workmentioning

confidence: 99%

“…GA-DAN [21] converts a source-domain image into multiple images of different spatial views as in target domain. Wu et al [22] aims at the serious domain difference between synthetic data and real-world data, and proposes a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data to real-world data. In this work, we focus on how to use unlabeled real-world data to improve the pre-trained model to obtain better initialization and final performance during finetuning.…”

Section: Related Workmentioning

confidence: 99%

UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection

Guo¹,

Zhou²,

Qin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent scene text detection methods are almost based on deep learning and data-driven. Synthetic data is commonly adopted for pre-training due to expensive annotation cost. However, there are obvious domain discrepancies between synthetic data and real-world data. It may lead to suboptimal performance to directly adopt the model initialized by synthetic data in the fine-tuning stage. In this paper, we propose a new training paradigm for scene text detection, which introduces an UNsupervised Intermediate Training Stage (UNITS) that builds a buffer path to real-world data and can alleviate the gap between the pre-training stage and finetuning stage. Three training strategies are further explored to perceive information from real-world data in an unsupervised way. With UNITS, scene text detectors are improved without introducing any parameters and computations during inference. Extensive experimental results show consistent performance improvements on three public datasets.

show abstract

“…Instead of adapting data, it is possible to learn features that are resistant to the differences between domains [13,57]. Wu et al [71] mix real and synthetic data through a domain classifier to learn domain-invariant features for text detection, and Saleh et al [56] exploit the observation that shape is less affected by the domain gap than appearance for scene semantic segmentation.…”

Section: Training With Synthetic Datamentioning

confidence: 99%

Fake It Till You Make It: Face analysis in the wild using synthetic data alone

Wood¹,

Baltrušaitis²,

Hewitt³

et al. 2021

Preprint

View full text Add to dashboard Cite

We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap with data mixing, domain adaptation, and domain-adversarial training, but we show that it is possible to synthesize data with minimal domain gap, so that models trained on synthetic data generalize to real in-the-wild datasets. We describe how to combine a procedurally-generated parametric 3D face model with a comprehensive library of hand-crafted assets to render training images with unprecedented realism and diversity. We train machine learning systems for face-related tasks such as landmark localization and face parsing, showing that synthetic data can both match real data in accuracy as well as open up new approaches where manual labeling would be impossible.* Denotes equal contribution. https://microsoft.github.io/FaceSynthetics

show abstract

Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild

Cited by 15 publications

References 32 publications

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection

Fake It Till You Make It: Face analysis in the wild using synthetic data alone

Contact Info

Product

Resources

About