2021
DOI: 10.48550/arxiv.2110.15023
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Abstract: Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other highresource languages, such as German or Italian. To address this problem, AI Hub recently released seven types o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 29 publications
0
5
0
Order By: Relevance
“…As the experiments used Korean text, additional Korean text images were deployed to fine-tune the STR model for Korean. Additional Korean text images were acquired from AI Hub 30 , which is a South Korean national platform disclosing high-quality and high-capacity data for artificial intelligence research and development 31 . The Korean text image dataset contained printed text, handwritten text, and text in real-world scenarios.…”
Section: Experimental Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…As the experiments used Korean text, additional Korean text images were deployed to fine-tune the STR model for Korean. Additional Korean text images were acquired from AI Hub 30 , which is a South Korean national platform disclosing high-quality and high-capacity data for artificial intelligence research and development 31 . The Korean text image dataset contained printed text, handwritten text, and text in real-world scenarios.…”
Section: Experimental Methodsmentioning
confidence: 99%
“…Among the various images provided by AI Hub, book cover images, a subset of real-world text in the Korean text image dataset, were employed for training. Text in the dataset encompassing real-world scenarios is classified into four categories: book covers, goods, signboards, and traffic-sign images 31 . Among them, book cover images were selected to train the STR model, as other categories often have unique fonts or vivid colors, unlike defect tags.…”
Section: Experimental Methodsmentioning
confidence: 99%
“…For the general model (Not domain-specialized model), we utilized the Korean-English parallel corpora from the following data sources: subtitles corpus from OpenSubtitles, 1 the AI Hub Korean-English parallel corpus [34], 2 and the IWSLT 2017 Korean-English parallel corpus [35]. From these data sources, we constructed 2.7M training corpora.…”
Section: Brute Ccmmentioning
confidence: 99%
“…The overall statistics including the number of datasets, and the minimum, maximum, and average length of a sentence are presented in Table 4. Unlabeled data for augmenting the TOEIC data was obtained from AI Hub [27,28], where quality is guaranteed by human evaluation. The English side texts were leveraged from 1,602,708 Korean-English parallel corpus.…”
Section: Experiments 41 Dataset Detailsmentioning
confidence: 99%