Although long-read sequencing can often enable chromosome-level reconstruction of genomes, it is still unclear how one can routinely obtain gapless assemblies. In the model plant Arabidopsis thaliana, other than the reference accession Col-0, all other accessions de novo assembled with long-reads until now have used PacBio continuous long reads (CLR). Although these assemblies sometimes achieved chromosome-arm level contigs, they inevitably broke near the centromeres, excluding megabases of DNA from analysis in pan-genome projects. Since PacBio high-fidelity (HiFi) reads circumvent the high error rate of CLR technologies, albeit at the expense of read length, we compared a CLR assembly of accession Eyach15-2 to HiFi assemblies of the same sample. The use of five different assemblers starting from subsampled data allowed us to evaluate the impact of coverage and read length. We found that centromeres and rDNA clusters are responsible for 71% of contig breaks in the CLR scaffolds, while relatively short stretches of GA/TC repeats are at the core of >85% of the unfilled gaps in our best HiFi assemblies. Since the HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, we demonstrate the value of the approach by comparing these previously inaccessible regions of the genome between the Eyach15-2 accession and the reference accession Col-0.
The Short‐chain Dehydrogenases/Reductases Engineering Database (SDRED) covers one of the largest known protein families (168 150 proteins). Assignment to the superfamilies of Classical and Extended SDRs was achieved by global sequence similarity and by identification of family‐specific sequence motifs. Two standard numbering schemes were established for Classical and Extended SDRs that allow for the determination of conserved amino acid residues, such as cofactor specificity determining positions or superfamily specific sequence motifs. The comprehensive sequence dataset of the SDRED facilitates the refinement of family‐specific sequence motifs. The glycine‐rich motifs for Classical and Extended SDRs were refined to improve the precision of superfamily classification. In each superfamily, the majority of sequences formed a tightly connected sequence network and belonged to a large homologous family. Despite their different sequence motifs and their different sequence length, the two sequence networks of Classical and Extended SDRs are not separate, but connected by edges at a threshold of 40% sequence similarity, indicating that all SDRs belong to a large, connected network. The SDRED is accessible at https://sdred.biocatnet.de/.
Multicopper oxidases (MCOs) use copper ions as cofactors to oxidize a variety of substrates while reducing oxygen to water. MCOs have been identified in various taxa, with notable occurrences in fungi. The role of these fungal MCOs in lignin degradation sparked an interest due to their potential for application in biofuel production and various other industries. MCOs consist of different protein domains, which led to their classification into two‐, three‐, and six‐domain MCOs. The previously established Laccase and Multicopper Oxidase Engineering Database (https://lcced.biocatnet.de) was updated and now includes 51 058 sequences and 229 structures of MCOs. Sequences and structures of all MCOs were systematically compared. All MCOs consist of cupredoxin‐like domains. Two‐domain MCOs are formed by the N‐ and C‐terminal domain (domain N and C), while three‐domain MCOs have an additional domain (M) in between, homologous to domain C. The six‐domain MCOs consist of alternating domains N and C, each three times. Two standard numbering schemes were developed for the copper‐binding domains N and C, which facilitated the identification of conserved positions and a comparison to previously reported results from mutagenesis studies. Two sequence motifs for the copper binding sites were identified per domain. Their modularity, depending on the placement of the T1‐copper binding site, was demonstrated. Protein sequence networks showed relationships between two‐ and three‐domain MCOs, allowing for family‐specific annotation and inference of evolutionary relationships.
The ω-Transaminase Engineering Database (oTAED) was established as a publicly accessible resource on sequences and structures of the biotechnologically relevant ω-transaminases (ω-TAs) from Fold types I and IV. The oTAED integrates sequence and structure data, provides a classification based on fold type and sequence similarity, and applies a standard numbering scheme to identify equivalent positions in homologous proteins. The oTAED includes 67 210 proteins (114 655 sequences) which are divided into 169 homologous families based on global sequence similarity. The 44 and 39 highly conserved positions which were identified in Fold type I and IV, respectively, include the known catalytic residues and a large fraction of glycines and prolines in loop regions, which might have a role in protein folding and stability. However, for most of the conserved positions the function is still unknown. Literature information on positions that mediate substrate specificity and stereoselectivity was systematically examined. The standard numbering schemes revealed that many positions which have been described in different enzymes are structurally equivalent. For some positions, multiple functional roles have been suggested based on experimental data in different enzymes. The proposed standard numbering schemes for Fold type I and IV ω-TAs assist with analysis of literature data, facilitate annotation of ω-TAs, support prediction of promising mutation sites, and enable navigation in ω-TA sequence space. Thus, it is a useful tool for enzyme engineering and the selection of novel ω-TA candidates with desired biochemical properties.
Obverse Cover: The cover image is based on the Research Article The Short‐chain Dehydrogenase/Reductase Engineering Database (SDRED): A classification and analysis system for a highly diverse enzyme family by Maike Gräff et al., DOI: https://doi.org/10.1002/prot.25694.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.