“…Capturing the pronunciations from raw texts is challenging for end-to-end text-to-speech (TTS) systems [2,28,38,41,51,24,14,37], since there are full of words that are not covered by general pronunciation rules [4,21,46]. Therefore, polyphone 3 disambiguation (one of the biggest challenges in converting texts into phonemes [33,56,42]) plays an important role in the construction of highquality neural TTS systems [18,34]. However, since the exact pronunciation of a polyphone must be The illustration of the dictionary entry that contains information on character's or word's definitions, usages, and pronunciations.…”