Background: Identifying orthologous genes plays a pivotal role in comparative genomics as the orthologous genes remain less diverged in the course of evolution. However, identifying orthologous genes is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics, multiple paralogous genes, incomplete genome data, and for distantly related species.Results: We present NORTH, a novel, automated, highly accurate and scalable machine learning based orhtologous gene cluster prediction method. We have utilized the biological basis of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). NORTH outperforms the frequently used existing orthologous clustering algorithms on the OrthoBench benchmark, not only just quantitatively with a high margin, but qualitatively under the challenging scenarios as well. Furthermore, we studied 12,55,877 genes in the largest 250 orthologous clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life. NORTH is able to cluster them with 98.48% precision, 98.43% recall and 98.44% F 1 score.Conclusions: This is the first study that maps the orthology identification to the text classification problem, and achieves remarkable accuracy and scalability. NORTH thus advances the state-of-the-art in orthologous gene prediction, and has the potential to be considered as an alternative to the existing phylogenetic tree and BLAST based methods.