“…Annotations from all three databases are used to assemble 27 metrics for the neural network classifier. Briefly the metrics are as follows: [1] total proteins, [2] total KEGG annotations, [3] sum of KEGG v-scores, [4] total Pfam annotations, [5] sum of Pfam v-scores, [6] total VOG annotations, [7] sum of VOG v-scores, [8] total KEGG integration related annotations (e.g., integrase), [9] total KEGG annotations with a v-score of zero, [10] total KEGG integration related annotations (e.g., integrase), [11] total Pfam annotations with a v-score of zero, [12] total VOG redoxin (e.g., glutaredoxin) related annotations, [13] total VOG non-integrase integration related annotations, [14] total VOG integrase annotations, [15] total VOG ribonucleotide reductase related annotations, [16] total VOG nucleotide replication (e.g., DNA polymerase) related annotations, [17] total KEGG nuclease (e.g., restriction endonuclease) related annotations, [18] total KEGG toxin/anti-toxin related annotations, [19] total VOG hallmark protein (e.g., capsid) annotations, [20] total proteins annotated by KEGG, Pfam and VOG, [21] total proteins annotated by Pfam and VOG only, [22] total proteins annotated by Pfam and KEGG only, [23] total proteins annotated by KEGG and VOG only, [24] total proteins annotated by KEGG only, [25] total proteins annotated by Pfam only, [26] total proteins annotated by VOG only, and [27] total unannotated proteins. Non-annotation features such as gene density, average gene length and strand switching were not used because they were found to decrease performance of the neural network classifier despite being differentiating features between bacteria/archaea and viruses; viruses tend to have shorter genes, less intergenic space and strand switch less frequently.…”