The evolutionary speed hypothesis (ESH) suggests that molecular evolutionary rates are higher among species inhabiting warmer environments. Previously, the ESH has been investigated using small numbers of latitudinally-separated sister lineages; in animals, these studies typically focused on subsets of Chordata and yielded mixed support for the ESH. This study analyzed public DNA barcode sequences from the cytochrome c oxidase subunit I (COI) gene for six of the largest animal phyla (Arthropoda, Chordata, Mollusca, Annelida, Echinodermata, and Cnidaria) and paired latitudinally-separated taxa together informatically. Of 8037 lineage pairs, just over half (51.6%) displayed a higher molecular rate in the lineage inhabiting latitudes closer to the equator, while the remainder (48.4%) displayed a higher rate in the higher-latitude lineage. To date, this study represents the most comprehensive analysis of latitude-related molecular rate differences across animals. While a statistically-significant pattern was detected from our large sample size, our findings suggest that the EHS may not serve as a strong universal mechanism underlying the latitudinal diversity gradient and that COI molecular clocks may generally be applied across latitudes. This study also highlights the merits of using automation to analyze large DNA barcode datasets.
Myriad environmental and biological traits have been investigated for their roles in influencing the rate of molecular evolution across various taxonomic groups. However, most studies have focused on a single trait, while controlling for additional factors in an informal way, generally by excluding taxa. This study utilized a dataset of cytochrome c oxidase subunit I (COI) barcode sequences from over 7000 ray-finned fish species to test the effects of 27 traits on molecular evolutionary rates. Environmental traits such as temperature were considered, as were traits associated with effective population size including body size and age at maturity. It was hypothesized that these traits would demonstrate significant correlations with substitution rate in a multivariable analysis due to their associations with mutation and fixation rates, respectively. A bioinformatics pipeline was developed to assemble and analyze sequence data retrieved from the Barcode of Life Data System (BOLD) and trait data obtained from FishBase. For use in phylogenetic regression analyses, a maximum likelihood tree was constructed from the COI sequence data using a multi-gene backbone constraint tree covering 71% of the species. A variable selection method that included both single-and multivariable analyses was used to identify traits that contribute to rate heterogeneity estimated from different codon positions. Our analyses revealed that molecular rates associated most significantly with latitude, body size, and habitat type. Overall, this study presents a novel and systematic approach for integrative data assembly and variable selection methodology in a phylogenetic framework.
Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Imputation offers an alternative to removing cases with missing values from datasets. Imputation techniques that incorporate phylogenetic information into their estimations have demonstrated improved accuracy over standard techniques. However, previous studies of phylogenetic imputation tools are largely limited to simulations of numerical trait data, with categorical data not evaluated. It also remains to be explored whether the type of genetic data used affects imputation accuracy. We conducted a real data-based simulation study to compare the performance of imputation methods using a mixed-type trait dataset (lizards and amphisbaenians; order: Squamata). Selected methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Known values were removed from a complete-case dataset to simulate different missingness scenarios: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Each method (with and without phylogenetic information derived from mitochondrial and nuclear gene trees) was used to impute the removed values. The performances of the methods were evaluated for each trait and in each missingness scenario. A random forest method supplemented with a nuclear-derived phylogeny performed best overall, and this method was used to impute missing values in the original squamate dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to the complete-case data. However, phylogeny did not always improve performance for every trait and in every missingness scenario, and caution should be taken when imputing trait data, particularly in cases of extreme bias. Ultimately, these results support the use of a real data-based simulation procedure to select a suitable imputation strategy for a given mixed-type trait dataset. Moreover, they highlight the potential biases that complete-case usage may introduce into analyses.Author summaryThe issue of missing data is problematic in trait datasets as observations for rare or threatened species are often missing disproportionately. When only complete cases are used in an analysis, derived results may be biased. Imputation is an alternative to complete-case analysis and entails filling in the missing values using known observations. It has been demonstrated that including phylogenetic information in the imputation process improves accuracy of predicted values. However, most previous evaluations of imputation methods for trait datasets are limited to numerical, simulated data, with categorical traits not considered. Using a reptile dataset comprised of both numerical and categorical trait data, we employed a real data-based simulation strategy to select an optimal imputation method for the dataset. We evaluated the performance of four different imputation methods across different missingness scenarios (e.g. missing completely at random, values missing disproportionately for smaller species. Results indicate that imputed data better reflected the original dataset characteristics compared to complete-case data; however, the optimal imputation strategy for a given scenario was contingent on missingness scenario and trait type. As imputation performance varies depending on the properties of a given dataset, a real data-based simulation strategy can be used to provide guidance on best imputation practices.
Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.