Repeated double cross validation (rdCV) is a strategy for (a) optimizing the complexity of regression models and (b) for a realistic estimation of prediction errors when the model is applied to new cases (that are within the population of the data used). This strategy is suited for small data sets and is a complementary method to bootstrap methods. rdCV is a formal, partly new combination of known procedures and methods, and has been implemented in a function for the programming environment R, providing several types of plots for model evaluation. The current version of the software is dedicated to regression models obtained by partial least-squares (PLS). The applied methods for repeated splits of the data into test sets and calibration sets, as well as for estimation of the optimum number of PLS components, are described. The relevance of some parameters (number of segments in CV, number of repetitions) is investigated. rdCV is applied to two data sets from chemistry: (1) determination of glucose concentrations from near infrared (NIR) data in mash samples from bioethanol production; (2) modeling the gas chromatographic retention indices of polycyclic aromatic compounds from molecular descriptors. Models using all original variables and models using a small subset of the variables, selected by a genetic algorithm (GA), are compared by rdCV.
Cometary ices are rich in CO 2 , CO and organic volatile compounds, but the carbon content of cometary dust was only measured for the Oort Cloud comet 1P/Halley, during its flyby in 1986. The COmetary Secondary Ion Mass Analyzer (COSIMA)/Rosetta mass spectrometer analysed dust particles with sizes ranging from 50 to 1000 μm, collected over 2 yr, from 67P/Churyumov-Gerasimenko (67P), a Jupiter family comet. Here, we report 67P dust composition focusing on the elements C and O. It has a high carbon content (atomic C/Si = 5.5 +1.4 −1.2 on average ) close to the solar value and comparable to the 1P/Halley data. From COSIMA measurements, we conclude that 67P particles are made of nearly 50 per cent organic matter in mass, mixed with mineral phases that are mostly anhydrous. The whole composition, rich in carbon and non-hydrated minerals, points to a primitive matter that likely preserved its initial characteristics since the comet accretion in the outer regions of the protoplanetary disc.
This paper presents an analysis of entropy-based molecular descriptors. Specifically, we use real chemical structures, as well as synthetic isomeric structures, and investigate properties of and among descriptors with respect to the used data set by a statistical analysis. Our numerical results provide evidence that synthetic chemical structures are notably different to real chemical structures and, hence, should not be used to investigate molecular descriptors. Instead, an analysis based on real chemical structures is favorable. Further, we find strong hints that molecular descriptors can be partitioned into distinct classes capturing complementary information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.