Homoglyphs are pairs of visual representations of Unicode characters that look similar to the human eye. Identifying homoglyphs is extremely useful for building a strong defence mechanism against many phishing and spoofing attacks, ID imitation, profanity abusing, etc. Although there is a list of discovered homoglyphs published by Unicode consortium, regular expansion of Unicode character scripts necessitates a robust and reliable algorithm that is capable of identifying all possible new homoglyphs. In this article, we first show that shallow Convolutional Neural Networks are capable of identifying homoglyphs. We propose two variations, both of which obtain very high accuracy (99.44%) on our benchmark dataset. We also report that adoption of transfer learning allows for another model to achieve 100% recall on our dataset. We ensemble these three methods to obtain 99.72% accuracy on our independent test dataset. These results illustrate the superiority of our ensembled model in detecting homoglyphs and suggest that our model can be used to detect new homoglyphs when increasing Unicode characters are added. As a by-product, we also prepare a benchmark dataset based on the currently available list of homoglyphs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.