9A fundamental challenge in immunology is diagnostic classification based on repertoire se-10 quence. We used the principle of maximum entropy (MaxEnt) to build compact representations 11 of antibody (IgH) and T-cell receptor (TCRβ) CDR3 repertoires based on the statistical biophysi-12 cal patterns latent in the frequency and ordering of repertoires' constituent amino acids. This 13 approach results in substantial advantages in quality, dimensionality, and training speed com-14 pared to MaxEnt models based solely on the standard 20-letter amino-acid alphabet. De-15 scriptor-based models learn patterns that pure amino-acid-based models cannot. We demon-16 strate the utility of descriptor models by successfully classifying influenza vaccination status 17 (AUC=0.97, p=4×10 -3 ), requiring only 31 samples from 14 individuals. Descriptor-based MaxEnt 18 modeling is a powerful new method for dissecting, encoding, and classifying complex reper-19 toires. 20 3 Arora et al. (2019) Repertoire-Based Diagnostics Using Statistical Biophysics Introduction 21 A major challenge in systems immunology is determining how to describe the sequence-level 22 heterogeneity of antibody (immunoglobulin; Ig) and T-cell receptor (TCR) repertoires in ways 23 that facilitate the identification of meaningful patterns. Sequence-frequency distributions-for 24 example, counts of unique IgH or TCRβ CDR3s-are commonly used but not ideal for interper-25 sonal comparisons, since repertoires from different people are largely disjoint (Robins et al., 26 2010; Arnaout et al., 2011). Motif-frequency distributions, which count how often each of the 20 n 27 possible n-mers appears in a repertoire (for some choice of n), are more likely to overlap be-28 tween individuals, but may fail to detect probabilistic or higher-order patterns and are subject to 29 sampling-related bias unless n is small. Comparisons of frequency distributions between reper-30 toires from different individuals have yielded important insights (Parameswaran et al., 2013; 31 Kaplinsky et al., 2014; Emerson et al., 2017; Sun et al., 2017) but the limitations of this ap-32 proach suggest a need for complementary methods. One such method is maximum-entropy 33 (MaxEnt) modeling (Fig. 1). 34 MaxEnt models, which were first developed for statistical physics and information theory 35 (Jaynes, 1957), can be used to describe repertoires (or other complex ensembles of proteins, 36 nucleic acids, etc.) in terms of constraints called biases that determine the ways in which a giv-37 en repertoire differs from a uniform distribution of sequences (Yeo and Burge, 2004; Russ et al., 38 2005; Seno et al., 2008; Mora et al., 2010; Marks et al., 2011). Given a set of features-for ex-39 ample, the frequencies of the 20 amino acids and the 20 2 =400 nearest-neighbor amino-acid 40 pairs ("neighbors" being defined as contiguous N-to-C-terminus amino acids)-a MaxEnt model 41 describes the degree to which each feature is biased away from its value in a uniform repertoire, 42 taking all the other biases into ac...