Motivation:Deep immune receptor sequencing, Repseq, provides unprecedented opportunities to identify condition-associated T-cell clones, represented by T-cell receptor (TCR) CDR3 sequences. TCR profiling has potential value for increasing immunopathological understanding of various diseases, and holds considerable clinical relevance. However, due to the immense diversity of the immune repertoire, identification of condition relevant TCR CDR3s from total repertoires has so far been limited either to mostly "public" CDR3 sequences, which are shared across unrelated individuals, or to comparisons of CDR3 frequencies from multiple samples from the same individual. A methodology for the identification of condition-associated TCR CDR3s by population level comparison of groups of Repseq samples is currently lacking.
Results:We implemented a computational pipeline that allows population level comparison of Repseq sample groups at the level of the immune repertoire sub-units that are shared across individuals. These sub-units (or sub-repertoires) represent shared immuno-genomic features across individuals that potentially encode common signatures in the immune response to antigens. The method first performs unsupervised clustering of CDR3 sequences within each sample based on their similarity in nucleotide or amino acid subsequence frequency. Next, it finds matching clusters across samples, the immune sub-repertoires, and performs statistical differential abundance testing at the level of the identified sub-repertoires. We applied the method on total TCR CDR3β Repseq datasets of celiac disease patients in gluten exposed and unexposed conditions, as well as on public dataset of yellow fever vaccination volunteers before and after immunization. The method successfully identified condition-associated CDR3β sequences, as evidenced by considerable agreement of TRBV-gene and positional amino acid usage patterns in the detected CDR3β sequences with previously known CDR3β species relevant to celiac disease. The method also recovered significantly high numbers of previously known CDR3β sequences, relevant to each condition than would be expected by chance. We conclude that immune sub-repertoires of similar immuno-genomic features, shared across unrelated individuals, encode common immunological information. Moreover, they can serve as viable units of population level immune repertoire comparison, serving as proxy for identification of condition-associated CDR3 sequences.
Supplementary Materials