2012
DOI: 10.1007/978-3-642-32790-2_23
|View full text |Cite
|
Sign up to set email alerts
|

Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

Abstract: Abstract. The paper presents experiments with the topic identification module which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module processes each acquired data item and assigns it topics from a defined topic hierarchy. The topic hierarchy is quite extensive -it contains about 450 topics and topic categories. It can easily happen that for some narrowly focused topic there is not enough data for the topic identification training. Lemmatizatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 12 publications
0
5
0
Order By: Relevance
“…For the experiments the smaller collection containing the articles from the news serveř CeskéNoviny.cz separated from the whole corpus was used [17]. The collection contains 31 419 articles, divided into 27 000 training and 4 419 testing articles.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…For the experiments the smaller collection containing the articles from the news serveř CeskéNoviny.cz separated from the whole corpus was used [17]. The collection contains 31 419 articles, divided into 27 000 training and 4 419 testing articles.…”
Section: Discussionmentioning
confidence: 99%
“…Lemmatization has been shown to improve the results when dealing with sparse data in the area of information retrieval [15] and spoken term detection [16] in highly inflected languages, on that account the experiments on the effects of lemmatization in the field of topic identification was performed [17]. As a result of these experiments the automatic text lemmatization is also applied in our work.…”
Section: System For Acquisition and Storing Datamentioning
confidence: 99%
“…Although other methods could be used (i.e. Cosine Similarity, see Rezvani and Hashemi, 2012), when we employ pre-processing techniques such as lemmatisation, the text data is normalised, and the Jaccard Index efficiently captures similarities of concepts (Skorkovská, 2012).…”
Section: Methodsmentioning
confidence: 99%
“…The text was extracted from the documents and statements and pre-processed to standardise the words to enhance the effectiveness of the analysis. The pre-processing techniques were employed using natural language processing procedures: remove stopwords (connectives); tokenisation , which separates each word from the text; and, most important, lemmatisation , which transforms each inflected form of the words to its lemma, so words like “changing” or “changes” are transformed into “change,” and therefore the Jaccard index can capture more clearly the concepts expressed in the words (Skorkovská, 2012).…”
Section: Methodsmentioning
confidence: 99%
“…For the experiments a smaller collection containing 31 419 articles from the news serveř CeskéNoviny.cz separated from the whole corpus was used [13]. The collection contains articles published in the year 2011(January to October) and is divided into 27 000 training and 4 419 testing articles.…”
Section: Test Datamentioning
confidence: 99%