“…There exists a plethora of algorithms for de novo clustering of generic nucleotide- [13][14][15][16], and protein-sequences [17,14,18,19]. Several algorithms have also been proposed for clustering of specific nucleotide data such as barcode sequences [20], EST sequences [21][22][23], full-length cDNA [24], RAD-seq [25], genomic or metagenomic short reads [26][27][28][29][30][31], UMI-tagged reads [32], full genomes and metagenomes [33], and contigs from RNA-seq assemblies [34]. However, our clustering problem has unique distinguishing characteristics: transcripts from the same gene have large indels due to alternative splicing, and the error rate and profile differs both between [2] and within [35] reads.…”