Zeros (i.e. events that do not happen) are the source of two common phenomena in count data: overdispersion and zero‐inflation. Zeros have multiple origins in a dataset: false zeros occur due to errors in the experimental design or the observer; structural zeros are related to the ecological or evolutionary restrictions of the system under study; and random zeros are the result of the sampling variability. Identifying the type of zeros and their relation with overdispersion and/or zero inflation is key to select the most appropriate statistical model. Here we review the different modelling options in relation to the presence of overdispersion and zero inflation, tested through the dispersion and zero inflation indices. We then examine the theory of the zero‐inflated (ZI) models and the use of the score tests to assess overdispersion and zero inflation over a model. In order to choose an adequate model when analysing count data we suggest the following protocol: Step 1) classify the zeros and minimize the presence of false zeros; Step 2) identify suitable covariates; Step 3) test the data for overdispersion and zero‐inflation and Step 4) choose the most adequate model based on the results of step 3 and use score tests to determine whether more complex models should be implemented. We applied the recommended protocol on a real dataset on plant–herbivore interactions to evaluate the suitability of six different models (Poisson, NB and their zero‐inflated versions—ZIP, ZINB). Our data were overdispersed and zero‐inflated, and the ZINB was the model with the best fit, as predicted. Ignoring overdispersion and/or zero inflation during data analyses caused biased estimates of the statistical parameters and serious errors in the interpretation of the results. Our results are a clear example on how the conclusions of an ecological hypothesis can change depending on the model applied. Understanding how zeros arise in count data, for example identifying the potential sources of structural zeros, is essential to select the best statistical design. A good model not only fits the data correctly but also takes into account the idiosyncrasies of the biological system.
In this work, we deal with correlated under-reported data through INAR(1)-hidden Markov chain models. These models are very flexible and can be identified through its autocorrelation function, which has a very simple form. A naïve method of parameter estimation is proposed, jointly with the maximum likelihood method based on a revised version of the forward algorithm. The most-probable unobserved time series is reconstructed by means of the Viterbi algorithm. Several examples of application in the field of public health are discussed illustrating the utility of the models. Copyright © 2016 John Wiley & Sons, Ltd.
BackgroundHigh-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user.ResultsHere we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.ConclusionsRNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.org.
We present a model based on two-order integer-valued autoregressive time series to analyze the number of hospital emergency service arrivals caused by diseases that present seasonal behavior. We also introduce a method to describe this seasonality, on the basis of Poisson innovations with monthly means. We show parameter estimation by maximum likelihood and model validation and show several methods for forecasting, on the basis of long-time means and short-time and long-time prediction regions. We analyze an application to model the number of hospital admissions per week caused by influenza.
yliveir D w r¡ % nd iin e kD to hen nd riguer sD w nuel nd eins uryD iliz eth nd uigD edro nd othk mmD u i @PHITA 9 eroEin) ted regression models for r di tionEindu ed hromosome err tion d t X omp r tive studyF9D fiometri l journ lFD SV @PAF ppF PSWEPUWFFurther information on publisher's website:httpXGGdxFdoiForgGIHFIHHPG imjFPHIRHHPQQPublisher's copyright statement:his is the epted version of the following rti leX yliveir D w r¡ % D iin e kD to henD riguer sD w nuelD eins uryD iliz ethD uigD edro nd othk mmD u i @PHITA eroEin) ted regression models for r di tionEindu ed hromosome err tion d t X omp r tive studyF fiometri l journ lD SV@PAX PSWEPUWD whi h h s een pu lished in (n l form t httpXGGdxFdoiForgGIHFIHHPG imjFPHIRHHPQQF his rti le m y e used for nonE ommer i l purposes in ord n e ith ileyE gr erms nd gonditions for selfE r hivingF Additional information: Use policyThe full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-pro t purposes provided that:• a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders.Please consult the full DRO policy for further details. Within the field of cytogenetic biodosimetry, Poisson regression is the classical approach for modelling the number of chromosome aberrations as a function of radiation dose. However, it is common to find data that exhibit overdispersion. In practice, the assumption of equidispersion may be violated due to unobserved heterogeneity in the cell population, which will render the variance of observed aberration counts larger than their mean, and/or the frequency of zero counts greater than expected for the Poisson distribution. This phenomenon is observable for both full and partial body exposure, but more pronounced for the latter. In this work, different methodologies for analysing cytogenetic chromosomal aberrations datasets are compared, with special focus on zero-inflated Poisson and zero-inflated negative binomial models. A score test for testing for zero-inflation in Poisson regression models under the identity link is also developed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.