2022
DOI: 10.14569/ijacsa.2022.01306109
|View full text |Cite
|
Sign up to set email alerts
|

On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts

Abstract: Due to highly unstructured and noisy data, analyzing society reports in written texts is very challenging. Classifying informal text data is still considered a difficult task in natural language processing since the texts could contain abbreviated words, repeating characters, typos, slang, et cetera. Therefore, text preprocessing is commonly performed to remove the noises and make the texts more structured. However, we argued that most tasks of preprocessing are no longer required if suitable word embeddings a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 23 publications
0
4
0
Order By: Relevance
“…It uses a clusterbased approach and this algorithm overcomes the problem of determining the number of clusters. The results demonstrated that this approach proved significant in handling morphology [47].…”
Section: Different Preprocessing Techniques Provide Different Classif...mentioning
confidence: 89%
“…It uses a clusterbased approach and this algorithm overcomes the problem of determining the number of clusters. The results demonstrated that this approach proved significant in handling morphology [47].…”
Section: Different Preprocessing Techniques Provide Different Classif...mentioning
confidence: 89%
“…In light of this, it was necessary to build different pre-processing pipelines to perform our experiments, depending on the class of models considered. Indeed, transformer-based models require few pre-processing steps (e.g., data cleaning) since the models are pre-trained over large text corpora; thus, the transformers already provide an initial word embedding for most words ( [24]). On the other hand, baseline ML models like XGBoost and LSTM benefit from the typical NLP pre-processing pipelines to reduce the number of features, such as Stemming and Stopword Removal.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…Results of collecting data was unstructured, so the next step is preprocessing data with Natural Language Processing (NLP) machine learning model [21]. Natural Language Processing have seven procedures, namely case folding, word normalization, cleansing filtering, stemming, and tokenizing.…”
Section: Text Preprocessingmentioning
confidence: 99%