2022
DOI: 10.48550/arxiv.2201.01956
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

Abstract: Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This pap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…This collection of keywords was revised and compiled into a keyword-to-topic mapping dictionary. Here, we used textacy [14] and the latest HuSpaCy (Orosz et al ., 2022) to collect key terms for the text set. The main steps of the “SZTAKI” annotation approach are as follows:Remove special audio transcript notations (e.g.…”
Section: Methodsmentioning
confidence: 99%
“…This collection of keywords was revised and compiled into a keyword-to-topic mapping dictionary. Here, we used textacy [14] and the latest HuSpaCy (Orosz et al ., 2022) to collect key terms for the text set. The main steps of the “SZTAKI” annotation approach are as follows:Remove special audio transcript notations (e.g.…”
Section: Methodsmentioning
confidence: 99%
“…Perplexity remains a common choice in practical applications. However, some research suggests that coherence is the most effective method for measuring topic quality [47], with increased usage of this metric in recent studies. Despite the guidance provided by the above model evaluation methods, issues such as mixed topics, illogical topics, and indistinguishable topics can still arise.…”
Section: B Optimal Topics Number Selectionmentioning
confidence: 99%
“…For this purpose, we used the transformer-based pipeline 2 developed for the HuSpaCy [93] natural language processing toolkit for Hungarian. 3 Then, the emotion labels that the sentence has been annotated with at the clause level are determined for each sentence.…”
Section: A Data Selection and Corpus Statisticsmentioning
confidence: 99%