On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts

Kurniasih, Aliyah; Manik, Lindung Parningotan

doi:10.14569/ijacsa.2022.01306109

Cited by 8 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It uses a clusterbased approach and this algorithm overcomes the problem of determining the number of clusters. The results demonstrated that this approach proved significant in handling morphology [47].…”

Section: Different Preprocessing Techniques Provide Different Classif...mentioning

confidence: 89%

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

2023

IJATEE

View full text Add to dashboard Cite

show abstract

Section: Different Preprocessing Techniques Provide Different Classif...mentioning

confidence: 89%

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

2023

IJATEE

View full text Add to dashboard Cite

show abstract

“…In light of this, it was necessary to build different pre-processing pipelines to perform our experiments, depending on the class of models considered. Indeed, transformer-based models require few pre-processing steps (e.g., data cleaning) since the models are pre-trained over large text corpora; thus, the transformers already provide an initial word embedding for most words ( [24]). On the other hand, baseline ML models like XGBoost and LSTM benefit from the typical NLP pre-processing pipelines to reduce the number of features, such as Stemming and Stopword Removal.…”

Section: Data Pre-processingmentioning

confidence: 99%

Language Models Fine-Tuning for Automatic Format Reconstruction of SEC Financial Filings

Lombardo,

Trimigno,

Pellegrino

et al. 2024

IEEE Access

View full text Add to dashboard Cite

The analysis of financial reports is a crucial task for investors and regulators, especially the mandatory annual reports (10-K) required by the SEC (Securities and Exchange Commission) that provide crucial information about a public company in the American stock market. Although SEC suggests a specific document format to standardize and simplify the analysis, in recent years, several companies have introduced their own format and organization of the contents, making human-based and automatic knowledge extraction inherently more difficult. In this research work, we investigate different Neural language models based on Transformer networks (Bidirectional recurrence-based, Autoregressive-based, and Autoencoders-based approaches) to automatically reconstruct an SEC-like format of the documents as a multi-class classification task with 18 classes at the sentence level. In particular, we propose a Bidirectional fine-tuning procedure to specialize pre-trained language models on this task. We propose and make the resulting novel transformer model, named SEC-former, publicly available to deal with this task. We evaluate SEC-former in three different scenarios: a) in terms of topic detection performances; b) in terms of document similarity (TF-IDF Bag-of-words and Doc2Vec) achieved with respect to original and trustable financial reports since this operation is leveraged for portfolio optimization tasks; c) testing the model in a real use-case scenario related to a public company that does not respect the SEC format but provides a humansupervised reference to reconstruct it.

show abstract

“…Results of collecting data was unstructured, so the next step is preprocessing data with Natural Language Processing (NLP) machine learning model [21]. Natural Language Processing have seven procedures, namely case folding, word normalization, cleansing filtering, stemming, and tokenizing.…”

Section: Text Preprocessingmentioning

confidence: 99%

Unbalanced Multiclass Classification With Adaptive Synthetic Multinomial Naive Bayes Approach

Fauzi,

Ismatullah,

Manfaati Nur

2023

IAPGOS

View full text Add to dashboard Cite

Opinions related to rising fuel prices need to be seen and analysed. Public opinion is closely related to public policy in Indonesia in the future. Twitter is one of the media that people use to convey their opinions. This study uses sentiment analysis to look at this phenomenon. Sentiment is divided into three categories: positive, neutral, and negative. The methods used in this research are Adaptive Synthetic Multinomial Naive Bayes, Adaptive Synthetic k-nearest neighbours, and Adaptive Synthetic Random Forest. The Adaptive Synthetic method is used to handle unbalanced data. The data used in this study are public arguments per province in Indonesia. The results obtained in this study are negative sentiments that dominate all provinces in Indonesia. There is a relationship between negative sentiment and the level of education, internet use, and the human development index. Adaptive Synthetic Multinomial Naive Bayes performed better than other methods, with an accuracy of 0.882. The highest accuracy of the Adaptive Synthetic Multinomial Naive Bayes method is 0.990 in Papua Barat Province.

show abstract

On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts

Cited by 8 publications

References 23 publications

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

Language Models Fine-Tuning for Automatic Format Reconstruction of SEC Financial Filings

Unbalanced Multiclass Classification With Adaptive Synthetic Multinomial Naive Bayes Approach

Contact Info

Product

Resources

About