“…The text of the raw corpus is from different domains such as Aesthetics (Culture, Cinema, Literature, Biographies, and Folklore), Commerce, Mass media (Classified, Discussion, Editorial, Sports, General news, Health, Weather, and Social), Science and Technology (Agriculture, Environmental Science, Textbook, Astrology, Mechanical Engineering, and Environmental Science) and Social Sciences (Economics, Education, Political Science, Linguistics, Health and Family Welfare, History, Text Book, Law, etc). We also acquired another corpus from the work (Narzary et al 2022). The final consolidated corpus has 1.6 million tokens and 191k sentences.…”