In this study, we present a Bangla multi-domain sentiment analysis dataset, named as SentiGOLD, developed using 70,000 samples, which was compiled from a variety of sources and annotated by a gender-balanced team of linguists. This dataset was created in accordance with a standard set of linguistic conventions that were established after multiple meetings between the Government of Bangladesh and a nationally recognized Bangla linguistics committee. Although there are standard sentiment analysis datasets available for English and other rich languages, there are not any such datasets in Bangla, especially because, there was no standard linguistics framework agreed upon by national stakeholders. Senti-GOLD derives its raw data from online video comments, social media posts and comments, blog posts and comments, news and numerous other sources. Throughout the development of this dataset, domain distribution and class distribution were rigorously maintained. SentiGOLD was created using data from a total of 30 domains (e.g. politics, entertainment, sports, etc.) and was labeled using 5 classes (e.g. strongly negative, weakly negative, neutral, weakly positive, and strongly positive). In order to maintain annotation quality, the national linguistics committee approved an annotation scheme to ensure a rigorous Inter Annotator Agreement (IAA) in a multi-annotator annotation scenario. This procedure yielded an IAA score of 0.88 using Fleiss' kappa method, which is elaborated upon in the paper. A protocol for intra-and crossdataset evaluation was utilized in our efforts to develop a classification system as a standard. The cross-dataset evaluation was performed on the SentNoB dataset, which contains noisy Bangla text samples, thereby establishing a demanding test scenario. We Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. KDD
Feature extraction is an essential step of Optical Character Recognition. Accurate and distinguishable feature plays a significant role to leverage the performance of a classifier. The complexity level of feature identification algorithm differs for alphabet sets of different languages. Apart from generic algorithms to find features of different alphabet sets, these algorithms take care of individual characteristic common for a particular alphabet set. Dominant features of one alphabet set might completely differ from that of another set. Since there always remains the chance that inaccurate features may cause inefficient recognition, special attention should be given to identify the set of optimal features of a character set. Bengali characters also have some specific issues apart from the existing issues of other character sets. For example, there are about 300 basic, modified, and compound character shapes in the script, the characters in a word are topologically connected, and Bengali is an inflectional language. Literature survey shows that several authors have used different features and classification algorithms. The authors have extensively reviewed all these feature sets. In order to identify an optimal feature set, variability analysis has been proposed here. They focus on the specific peculiarities of Bengali alphabet sets, its different usage as vowel and consonant signs, compound, complex, and touching characters. The authors also took care to generate easily computable features that take less time for generation. However, more attention needs to be given in order to choose an efficient classifier.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.