“…Their rules were also implemented in a recent MA toolkit Juman++ (Tolmachev et al, 2020) For English and Chinese, various classification methods for normalization of informal words (Li and Yarowsky, 2008;Wang et al, 2013;Han and Baldwin, 2011;Jin, 2015;van der Goot, 2019) have been developed based on, for example, string, phonetic, semantic similarity, or co-occurrence frequency. Qian et al (2015) proposed a transitionbased method with append(x), separate(x), and separate_and_substitute(x,y) operations for the joint word segmentation, POS tagging, and normalization of Chinese microblog text. Dekker and van der Goot (2020) automatically generated pseudo training data from English raw tweets using noise insertion operations to achieve comparable performance without manually annotated data to an existing LN system.…”