“…inferred based on its semantic contexts, current solutions still face several challenges: 1) the rulebased approaches [60,19] with limited linguistic knowledge or the neural models [56,34,42] trained on limited data suffer from significant performance degradation on out-of-domain text datasets; 2) the neural network-based models [6,5,34] learn the grapheme-to-phoneme (G2P) mapping in an end-to-end manner without explicit semantics modeling, which hinders their pronunciation accuracy in real-life applications. 3) based on the above two points, a reliable polyphone disambiguation module is usually based on a combination of hand-crafted rules, structured G2P-oriented lexicons, and neural models [16], which requires substantial phonemes labels and external knowledge from language experts.…”