14Over the last several years, metagenomics has enabled the assembly of millions of new viral 15 sequences that have vastly expanded our knowledge of Earth's viral diversity. However, these 16 sequences range from small fragments to complete genomes and no tools currently exist for 17 estimating their quality. To address this problem, we developed CheckV, which is an automated 18 pipeline for estimating the completeness of viral genomes as well as the identification and 19 removal of non-viral regions found on integrated proviruses. After validating the approach on 20 mock datasets, CheckV was applied to large and diverse viral genome collections, including 21 IMG/VR and the Global Ocean Virome, revealing that the majority of viral sequences were small 22 fragments, with just 3.6% classified as high-quality (i.e. > 90% completeness) or complete 23genomes. Additionally, we found that removal of host contamination significantly improved 24 identification of auxiliary metabolic genes and interpretation of viral-encoded functions. We 25 expect CheckV will be broadly useful for all researchers studying and reporting viral genomes 26 assembled from metagenomes. CheckV is freely available at: 27 http://bitbucket.org/berkeleylab/CheckV. 28 29 longer than this length. However, this "one-size-fits all" approach fails to account for the large 59 variability in viral genome sizes, which range from 2 kb in Circoviridae [18] up to 2.5 Mbp in 60Megaviridae [12], and thus gathers sequences representing a broad range of genome 61 completeness. Complete, circular genomes are commonly identified from the presence of direct 62terminal repeats [5][6][7], and sometimes from mapping paired-end sequencing reads [19], but tend 63 to be rare. VIBRANT [11] is a recently published tool that categorizes sequences into high-, 64 medium-, or low-quality tiers based on circularity and the presence of viral hallmark proteins, 65 but does not estimate genome completeness per se. 66 67With regard to contamination, existing approaches either remove viral contigs containing a high 68 fraction of microbial genes [5] or predict host-virus boundaries on proviruses [10, 11, 20, 21]. 69The former approach allows for a small number of microbial genes while the latter approach 70 may misidentify the true host-virus boundary. Other approaches detect viral signatures, but do 71 not account for the presence of microbial regions whatsoever [9]. With the diversity of available 72 viral prediction pipelines and protocols, there is a need for a standalone tool to ensure that viral 73 contigs are free of host contamination, and to remove it when present. 74 75Here, we present CheckV, a new tool to automatically estimate completeness and contamination 76for metagenome-assembled viral genomes. By collecting an extended database of complete viral 77 genomes from both isolates and environmental samples, CheckV was able to estimate the 78 completeness for the vast majority of contigs in the IMG/VR database, illustrating its broad 79 applicability to newly assembled g...