Abstract-We analyze the colon cancer data available from the SEER program with the aim of developing accurate survival prediction models for colon cancer. Carefully designed preprocessing steps resulted in removal of several attributes and applying several supervised classification methods. We also adopt synthetic minority over-sampling technique (SMOTE) to balance the survival and non-survival classes we have. In our experiments, ensemble voting of the three of the top performing classifiers was found to result in the best prediction performance in terms of prediction accuracy and area under the ROC curve. We evaluated multiple classification schemes to estimate the risk of mortality after 1 year, 2 years and 5 years of diagnosis, on a subset of 65 attributes after the data clean up process, 13 attribute carefully selected using attribute selection techniques, and SMOTE balanced set of the same 13 attributes, while trying to retain the predictive power of the original set of attributes. Moreover, we demonstrate the importance of balancing the classes of the data set to yield better results.
Organic Solar Cells are a promising technology for solving the clean energy crisis in the world. However, generating candidate chemical compounds for solar cells is a time-consuming process requiring thousands of hours of laboratory analysis. For a solar cell, the most important property is the power conversion efficiency which is dependent on the highest occupied molecular orbitals (HOMO) values of the donor molecules. Recently, machine learning techniques have proved to be very useful in building predictive models for HOMO values of donor structures of Organic Photovoltaic Cells (OPVs). Since experimental datasets are limited in size, current machine learning models are trained on data derived from calculations based on density functional theory (DFT). Molecular line notations such as SMILES or InChI are popular input representations for describing the molecular structure of donor molecules. The two types of line representations encode different information, such as SMILES defines the bond types while InChi defines protonation. In this work, we present an ensemble deep neural network architecture, called SINet, which harnesses both the SMILES and InChI molecular representations to predict HOMO values and leverage the potential of transfer learning from a sizeable DFT-computed dataset-Harvard CEP to build more robust predictive models for relatively smaller HOPV datasets. Harvard CEP dataset contains molecular structures and properties for 2.3 million candidate donor structures for OPV while HOPV contains DFT-computed and experimental values of 350 and 243 molecules respectively. Our results demonstrate significant performance improvement from the use of transfer learning and leveraging both molecular representations.
We present a deep learning approach to the indexing of electron backscatter diffraction (EBSD) patterns. We design and implement a deep convolutional neural network architecture to predict crystal orientation from the EBSD patterns. We design a differentiable approximation to the disorientation function between the predicted crystal orientation and the ground truth; the deep learning model optimizes for the mean disorientation error between the predicted crystal orientation and the ground truth using stochastic gradient descent. The deep learning model is trained using 374,852 EBSD patterns of polycrystalline nickel from simulation and evaluated using 1,000 experimental EBSD patterns of polycrystalline nickel. The deep learning model results in a mean disorientation error of 0.548° compared to 0.652° using dictionary based indexing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.