Genes encoding cell-surface proteins control nervous system development and are implicated in neurological disorders. These genes produce alternative mRNA isoforms which remain poorly characterized, impeding understanding of how disease-associated mutations cause pathology. Here we introduce a strategy to define complete portfolios of full-length isoforms encoded by individual genes. Applying this approach to neural cell-surface molecules, we identify thousands of unannotated isoforms expressed in retina and brain. By mass spectrometry we confirm expression of newly-discovered proteins on the cell surface in vivo. Remarkably, we discover that the major isoform of a retinal degeneration gene, CRB1, was previously overlooked. This CRB1 isoform is the only one expressed by photoreceptors, the affected cells in CRB1 disease. Using mouse mutants, we identify a function for this isoform at photoreceptor-glial junctions and demonstrate that loss of this isoform accelerates photoreceptor death. Therefore, our isoform identification strategy enables discovery of new gene functions relevant to disease.
The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell-type specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species-specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results demonstrate that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.
The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell-type specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species-specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results demonstrate that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.