Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

Skorkovská, Lucie

doi:10.1007/978-3-642-32790-2_23

Cited by 8 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the experiments the smaller collection containing the articles from the news serveř CeskéNoviny.cz separated from the whole corpus was used [17]. The collection contains 31 419 articles, divided into 27 000 training and 4 419 testing articles.…”

Section: Discussionmentioning

confidence: 99%

“…Lemmatization has been shown to improve the results when dealing with sparse data in the area of information retrieval [15] and spoken term detection [16] in highly inflected languages, on that account the experiments on the effects of lemmatization in the field of topic identification was performed [17]. As a result of these experiments the automatic text lemmatization is also applied in our work.…”

Section: System For Acquisition and Storing Datamentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification

Skorkovská

2013

Text, Speech, and Dialogue

Self Cite

View full text Add to dashboard Cite

Abstract. Nowadays, the multi-label classification is increasingly required in modern categorization systems. It is especially essential in the task of newspaper article topics identification. This paper presents a method based on general topic model normalisation for finding a threshold defining the boundary between the "correct" and the "incorrect" topics of a newspaper article. The proposed method is used to improve the topic identification algorithm which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module uses the Naive Bayes classifier for the multiclass and multi-label classification problem and assigns to each article the topics from a defined quite extensive topic hierarchy -it contains about 450 topics and topic categories. The results of the experiments with the improved topic identification algorithm are presented in this paper.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: System For Acquisition and Storing Datamentioning

confidence: 99%

Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification

Skorkovská

2013

Text, Speech, and Dialogue

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although other methods could be used (i.e. Cosine Similarity, see Rezvani and Hashemi, 2012), when we employ pre-processing techniques such as lemmatisation, the text data is normalised, and the Jaccard Index efficiently captures similarities of concepts (Skorkovská, 2012).…”

Section: Methodsmentioning

confidence: 99%

“…The text was extracted from the documents and statements and pre-processed to standardise the words to enhance the effectiveness of the analysis. The pre-processing techniques were employed using natural language processing procedures: remove stopwords (connectives); tokenisation , which separates each word from the text; and, most important, lemmatisation , which transforms each inflected form of the words to its lemma, so words like “changing” or “changes” are transformed into “change,” and therefore the Jaccard index can capture more clearly the concepts expressed in the words (Skorkovská, 2012).…”

Section: Methodsmentioning

confidence: 99%

Do Non-State Actors Influence Climate Change Policy? Evidence from the Brazilian Nationally Determined Contributions for COP21

Alves

Albuquerque

Ferreira

et al. 2021

Journal of Politics in Latin America

View full text Add to dashboard Cite

Participation in democratic regimes has been a central issue in foreign policy (FP) studies. This article seeks to contribute to the empirical discussion about FP participation through the analysis of the public consultation process conducted by the Brazilian Ministry of Foreign Affairs with non-state actors in the context of the preparations for the Paris Climate Agreement (2015). We employed automated text analysis using Python and R qualifying open responses submitted to the questionnaire launched at the first round of the consultations process and comparing them to the official document presented by Brazil establishing its own carbon emission targets. We found that the Brazilian academia members had a relevant influence on the content of the final document presented by Brazil, strengthening the literature on the importance of the epistemic community to environmental politics and raising new questions on the paths of foreign policy influence.

show abstract

“…For the experiments a smaller collection containing 31 419 articles from the news serveř CeskéNoviny.cz separated from the whole corpus was used [13]. The collection contains articles published in the year 2011(January to October) and is divided into 27 000 training and 4 419 testing articles.…”

Section: Test Datamentioning

confidence: 99%

Score Normalization Methods Applied to Topic Identification

Skorkovská

Zajíc

2014

Text, Speech and Dialogue

Self Cite

View full text Add to dashboard Cite

show abstract

Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

Cited by 8 publications

References 12 publications

Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification

Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification

Do Non-State Actors Influence Climate Change Policy? Evidence from the Brazilian Nationally Determined Contributions for COP21

Score Normalization Methods Applied to Topic Identification

Contact Info

Product

Resources

About