2019
DOI: 10.1093/llc/fqy074
|View full text |Cite
|
Sign up to set email alerts
|

Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
8
1
1

Relationship

1
9

Authors

Journals

citations
Cited by 16 publications
(16 citation statements)
references
References 3 publications
0
16
0
Order By: Relevance
“…We got data from three sources to train Language Model (LM) and appreciate their word vectors. We have received more than 188 million words from the AsoSoft corpus [30] collected from various sources such as websites, textbooks, and magazines. Muhammad Azizi and AramRafeq have had a lot of difficulties collecting data on Kurdish websites1 which is about 60 million tokens.…”
Section: Central Kurdish Text Corpus For Language Modelmentioning
confidence: 99%
“…We got data from three sources to train Language Model (LM) and appreciate their word vectors. We have received more than 188 million words from the AsoSoft corpus [30] collected from various sources such as websites, textbooks, and magazines. Muhammad Azizi and AramRafeq have had a lot of difficulties collecting data on Kurdish websites1 which is about 60 million tokens.…”
Section: Central Kurdish Text Corpus For Language Modelmentioning
confidence: 99%
“…Collecting existing dictionaries Pronunciation dictionaries were collected from various sources, including online dictionaries for individual language varieties [17,24,25,26,27], and dictionary collections such as ipa-dict [28] and Wikipron [23]. Dictionaries from the same variety was merged.…”
Section: Multilingual Pronunciation Dictionariesmentioning
confidence: 99%
“…The normalization is much more important when it comes to the Kurdish Language since Kurdish writers and publishers utilize a variety of encoding schemes and orthographic standards [47] . To develop a text-to-speech system, we do text normalization, and the details of normalization completed on the text corpus are presented in [48,49] in our preprocessing stage. Some Kurdish writing includes a variety of numerical forms, such as date, time, and amounts.…”
Section: Pre-processingmentioning
confidence: 99%