Clustering approaches are pivotal to handle the many sequence variants
obtained in DNA metabarcoding datasets, therefore they have become a key
step of metabarcoding analysis pipelines. Clustering often relies on a
sequence similarity threshold to gather sequences in Molecular
Operational Taxonomic Units (MOTUs) that ideally each represent a
homogeneous taxonomic entity, e.g. a species or a genus. However, the
choice of the clustering threshold is rarely justified, and its impact
on MOTU over-splitting or over-merging even less tested. Here, we
evaluated clustering threshold values for several metabarcoding markers
under different criteria: limitation of MOTU over-merging, limitation of
MOTU over-splitting, and trade-off between over-merging and
over-splitting. We extracted sequences from a public database for eight
markers, ranging from generalist markers targeting Bacteria or
Eukaryota, to more specific markers targeting a class or a subclass
(e.g. Insecta, Oligochaeta). Based on the distributions of pairwise
sequence similarities within species and within genera and on the rates
of over-splitting and over-merging across different clustering
thresholds, we were able to propose threshold values minimizing the risk
of over-splitting, that of over-merging, or offering a trade-off between
the two risks. For generalist markers, high similarity thresholds
(0.96-0.99) are generally appropriate, while more specific markers
require lower values (0.85-0.96). These results do not support the use
of a fixed clustering threshold (e.g. 0.97). Instead, we advocate a
careful examination of the most appropriate threshold based on the
research objectives, the potential costs of over-splitting and
over-merging, and the features of the studied markers.