“…Nonetheless, challenges persist, particularly in accurately representing all languages within tokenizer frameworks due to variations in language characters and other linguistic factors. Data cleaning [95], [35], [96], [49], [62], [65], [58], [68], [67], [89], [74], [75], [76], [77], [88], [94], [100], [103], [104], [107], [80], [106], [81], [108], [109], [82], [53], [110], [111], [83], [92], [113], [114], [115], [116], [117] Stemming [51], [67], [104], [108],…”