Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Park, Chanjun; Shim, Midan; Eo, Sugyeong; Lee, Seolhwa; Seo, Jaehyung; Moon, Hyeonseok; Lim, Heuiseok

doi:10.48550/arxiv.2110.15023

Cited by 4 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the experiments used Korean text, additional Korean text images were deployed to fine-tune the STR model for Korean. Additional Korean text images were acquired from AI Hub 30 , which is a South Korean national platform disclosing high-quality and high-capacity data for artificial intelligence research and development 31 . The Korean text image dataset contained printed text, handwritten text, and text in real-world scenarios.…”

Section: Experimental Methodsmentioning

confidence: 99%

“…Among the various images provided by AI Hub, book cover images, a subset of real-world text in the Korean text image dataset, were employed for training. Text in the dataset encompassing real-world scenarios is classified into four categories: book covers, goods, signboards, and traffic-sign images 31 . Among them, book cover images were selected to train the STR model, as other categories often have unique fonts or vivid colors, unlike defect tags.…”

Section: Experimental Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Automated hand-marked semantic text recognition from photographs

Suh,

Lee,

Gil

et al. 2023

Sci Rep

View full text Add to dashboard Cite

Automated text recognition techniques have made significant advancements; however, certain tasks still present challenges. This study is motivated by the need to automatically recognize hand-marked text on construction defect tags among millions of photographs. To address this challenge, we investigated three methods for automating hand-marked semantic text recognition (HMSTR)—a modified scene text recognition-based (STR) approach, a two-step HMSTR approach, and a lumped approach. The STR approach involves locating marked text using an object detection model and recognizing it using a competition-winning STR model. Similarly, the two-step HMSTR approach first localizes the marked text and then recognizes the semantic text using an image classification model. By contrast, the lumped approach performs both localization and identification of marked semantic text in a single step using object detection. Among these approaches, the two-step HMSTR approach achieved the highest F1 score (0.92) for recognizing circled text, followed by the STR approach (0.87) and the lumped approach (0.78). To validate the generalizability of the two-step HMSTR approach, subsequent experiments were conducted using check-marked text, resulting in an F1 score of 0.88. Although the proposed methods have been tested specifically with tags, they can be extended to recognize marked text in reports or books.

show abstract

Section: Experimental Methodsmentioning

confidence: 99%

Section: Experimental Methodsmentioning

confidence: 99%

Automated hand-marked semantic text recognition from photographs

Suh,

Lee,

Gil

et al. 2023

Sci Rep

View full text Add to dashboard Cite

show abstract

“…For the general model (Not domain-specialized model), we utilized the Korean-English parallel corpora from the following data sources: subtitles corpus from OpenSubtitles, 1 the AI Hub Korean-English parallel corpus [34], 2 and the IWSLT 2017 Korean-English parallel corpus [35]. From these data sources, we constructed 2.7M training corpora.…”

Section: Brute Ccmmentioning

confidence: 99%

Mimicking Infants’ Bilingual Language Acquisition for Domain Specialized Neural Machine Translation

Park

Go²,

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Existing methods of training domain-specialized neural machine translation (DS-NMT) models are based on the pretrain-finetuning approach (PFA). In this study, we reinterpret existing methods based on the perspective of cognitive science related to cross language speech perception. We propose the cross communication method (CCM), a new DS-NMT training approach. Inspired by the learning method of infants, we perform DS-NMT training by configuring and training DC and GC concurrently in batches. Quantitative and qualitative analysis of our experimental results show that CCM can achieve superior performance compared to the conventional methods. Additionally, we conducted an experiment considering the DS-NMT service to meet industrial demands.INDEX TERMS Domain-specialized neural machine translation, cross communication method, deep learning, neural machine translation.

show abstract

“…The overall statistics including the number of datasets, and the minimum, maximum, and average length of a sentence are presented in Table 4. Unlabeled data for augmenting the TOEIC data was obtained from AI Hub [27,28], where quality is guaranteed by human evaluation. The English side texts were leveraged from 1,602,708 Korean-English parallel corpus.…”

Section: Experiments 41 Dataset Detailsmentioning

confidence: 99%

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

et al. 2022

Self Cite

View full text Add to dashboard Cite

Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance.

show abstract

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Cited by 4 publications

References 29 publications

Automated hand-marked semantic text recognition from photographs

Automated hand-marked semantic text recognition from photographs

Mimicking Infants’ Bilingual Language Acquisition for Domain Specialized Neural Machine Translation

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

Contact Info

Product

Resources

About