10Bacterial species with large sequence diversity enable studies focused on comparative genomics, 11 population genetics and pan-genome evolution. In such analyses it is key to determine whether 12 sequences (e.g. genes) from different strains, are the same or different. This is often achieved by 13 clustering orthologous genes based on sequence similarity. Importantly, one limitation of existing 14 pan-genome clustering methods is that they do not assign a confidence score to the identified 15 clusters. Given that clustering ground truth is unavailable when working with pan-genomes, the 16 absence of confidence scores makes performance evaluation on real data an open challenge. 17 Moreover, most pan-genome clustering solutions do not accommodate cluster augmentation, 18 which is the addition of new sequences to an already clustered set of sequences. Finally, the pan-19 genome size of many organisms prevents direct application of powerful clustering techniques that 20 do not scale to large datasets. Here, we present Boundary-Forest Clustering (BFClust), a method 21 that addresses these challenges in three main steps: 1) The approximate-nearest-neighbor retrieval 22 method Boundary-Forest is used as a representative selection step; 2) Downstream clustering of 23 the representatives is performed using Markov Clustering (MCL); 3) Consensus clustering is 24 applied across the Boundary-Forest, improving clustering accuracy and enabling confidence score 25 calculation. First, MCL is favorably benchmarked against 6 powerful clustering methods. To 26 explore the strengths of the entire BFClust approach, it is applied to 4 different datasets of the 27 bacterial pathogen Streptococcus pneumoniae, and compared against 4 other pan-genome 28 clustering tools. Unlike existing approaches, BFClust is fast, accurate, robust to noise and allows 29 augmentation. Moreover, BFClust uniquely identifies low-confidence clusters in each dataset, 30 which can negatively impact downstream analyses and interpretation of pan-genomes. Being the 31 first tool that outputs confidence scores both when clustering de novo, and during cluster 32 augmentation, BFClust offers a way of automatically evaluating and eliminating ambiguity in pan-33 genomes. 34 35 3 36 Author Summary 37Clustering of biological sequences is a critical step in studying bacterial species with large 38 sequence diversity. Existing clustering approaches group sequences together based on similarity.
39However, these approaches do not offer a way of evaluating the confidence of their output. This 40 makes it impossible to determine whether the clustering output reflect biologically relevant 41 clusters. Most existing methods also do not allow cluster augmentation, which is the quick 42 incorporation and clustering of newly available sequences with an already clustered set. We 43 present Boundary-Forest Clustering (BFClust) as a method that can generate cluster confidence 44 scores, as well as allow cluster augmentation. In addition to having these additional key 45 functio...