Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences in UniProtKB has experimentally determined functional annotations. Computational methods may predict protein function in a high-throughput way, but its accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted inter-residue contact graphs and protein sequence embedding.
Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, GAT-GO yields Fmax(maximum F-score) 0.508, 0.416, 0.501, and AUPRC(area under the precision-recall curve) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than homology-based method BLAST (Fmax 0.117,0.121,0.207 and AUPRC 0.120, 0.120, 0.163). On the PDB-cdhit testset where the training and test proteins share higher sequence identity, GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published graph convolution method DeepFRI, which has Fmax 0.542, 0.425, 0.424 and AUPRC 0.313, 0.159, 0.193.