TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the African-descended populations, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes while the rest appear to be intergenic.
Comparative genomics promises to rapidly accelerate the identification and functional classification of biologically important human genes. We developed the TIGR Orthologous Gene Alignment (TOGA; 〈http://www.tigr.org/tdb/toga/toga.shtml〉) database to provide a cross-reference between fully and partially sequenced eukaryotic transcribed sequences. Starting with the assembled expressed sequence tag (EST) and gene sequences that comprise the 28 TIGR Gene Indices, we used high-stringency pair-wise sequence searches and a reflexive, transitive closure process to associate sequence-specific best hits, generating 32,652 tentative ortholog groups (TOGs). This has allowed us to identify putative orthologs and paralogs for known genes, as well as those that exist only as uncharacterized ESTs and to provide links to additional information including genome sequence and mapping data. TOGA provides an important new resource for the analysis of gene function in eukaryotes. In addition, an analysis of the most widely represented sequences can begin to provide insight into eukaryotic biological processes
Introduction
While lung cancer is generally thought to be environmentally provoked, anecdotal familial clustering has been reported suggesting there may be genetic susceptibility factors. We systematically tested whether germline mutations in eight candidate genes may be risk factors for lung adenocarcinoma.
Methods
We studied lung adenocarcinoma cases for whom germline sequence data had been generated as part of The Cancer Genome Atlas (TCGA) project, but that had not been previously analyzed. We selected eight genes, ATM, BRCA2, CHEK2, EGFR, PARK2, TERT, TP53, and YAP1, based on prior anecdotal association with lung cancer or genome wide association studies.
Results
Among 555 lung adenocarcinoma cases, we detected 14 pathogenic mutations in five genes; they occurred at a frequency of 2.5% and represented an odds ratio of 66 (95 confidence interval, 33 to 125, P<0.0001, chi-square test). The mutations fell most commonly in ATM (50%), followed by TP53, BRCA2, EGFR and PARK2. The majority (86%) of these variants had been reported in other familial cancer syndromes. Another 12 cases (2%) carried ultra-rare variants that were predicted to be deleterious by three protein prediction programs; these most frequently involved ATM and BRCA2.
Conclusions
A subset of lung adenocarcinoma patients, at least 2.5% to 4.5%, carries germline variants that have been linked to cancer risk in Mendelian syndromes. The genes fall most frequently in DNA repair pathways. Our data indicate that lung adenocarcinoma, similar to other solid tumors, contains a subset of patients with inherited susceptibility.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.