Real-time selective sequencing of individual DNA fragments, or 'Read Until', allows the focusing of Oxford Nanopore Technology sequencing on pre-selected genomic regions.This can lead to large improvements in DNA sequencing performance in many scenarios where only part of the DNA content of a sample is of interest. This approach is based on the idea of deciding whether to sequence a fragment completely after having sequenced only a small initial part of it. If, based on this small part, the fragment is not deemed of (sufficient) interest it is rejected and sequencing is continued on a new fragment. To date, only simple decision strategies based on location within a genome have been proposed to determine what fragments are of interest. We present a new mathematical model and algorithm for the real-time assessment of the value of prospective fragments.Our decision framework is based not only on which genomic regions are a priori interesting, but also on which fragments have so far been sequenced, and so on the current information available regarding the genome being sequenced. As such, our strategy can adapt dynamically during each run, focusing sequencing efforts in areas of highest uncertainty (typically areas currently low coverage). We show that our approach 101 For each position i of a reference genome of length N , we denote π i (g) the 102 location-specific prior on genotypes g ∈ G before any data have been observed. In all 103 applications below, when considering a haploid genome, we define the prior of reference 104 February 7, 2020 5/31 nucleotide b R at position i as π i (b R ) = 1 − θ, with θ the genetic diversity of the 105 considered population. Conversely, π i (g) = θ/3 if g = b R .
106When considering diploid sequenced genomes, we still assume a haploid reference 107 genome, with reference nucleotide at a given position denoted b R . In the case of a 108 diploid unphased genome being sequenced, we define π i ({b R , b R }) = 1 − θ, and 109 π i ({g, g}) = p homo θ/3 if g = b R , with p homo being the proportion of site differences from 110 a reference that are expected to be homozygous, and π i ({g, b R }) = (1 − p homo )θ/3 for 111 g = b R . We ignore the possibility of a heterozygous genome being sequenced with both 112 alleles different from the reference genome. These prior probability definitions also 113 ignore differences in mutation rates across nucleotides and genome positions and do not 114 use prior knowledge on SNP locations derived from the population; when available, 115 these aspects could however easily be included in the definition of π i (g).
116