BackgroundConvolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tree being used as the proximity measure. The patristic distance between variables is used together with a sparsified version of MultiDimensional Scaling to embed the phylogenetic tree in a Euclidean space.ResultsPh-CNN is tested with a domain adaptation approach on synthetic data and on a metagenomics collection of gut microbiota of 38 healthy subjects and 222 Inflammatory Bowel Disease patients, divided in 6 subclasses. Classification performance is promising when compared to classical algorithms like Support Vector Machines and Random Forest and a baseline fully connected neural network, e.g. the Multi-Layer Perceptron.ConclusionPh-CNN represents a novel deep learning approach for the classification of metagenomics data. Operatively, the algorithm has been implemented as a custom Keras layer taking care of passing to the following convolutional layer not only the data but also the ranked list of neighbourhood of each sample, thus mimicking the case of image data, transparently to the user.
Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file.In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.
Artificial Intelligence is exponentially increasing its impact on healthcare. As deep learning is mastering computer vision tasks, its application to digital pathology is natural, with the promise of aiding in routine reporting and standardizing results across trials. Deep learning features inferred from digital pathology scans can improve validity and robustness of current clinico-pathological features, up to identifying novel histological patterns, e.g ., from tumor infiltrating lymphocytes. In this study, we examine the issue of evaluating accuracy of predictive models from deep learning features in digital pathology, as an hallmark of reproducibility. We introduce the DAPPER framework for validation based on a rigorous Data Analysis Plan derived from the FDA’s MAQC project, designed to analyze causes of variability in predictive biomarkers. We apply the framework on models that identify tissue of origin on 787 Whole Slide Images from the Genotype-Tissue Expression (GTEx) project. We test three different deep learning architectures (VGG, ResNet, Inception) as feature extractors and three classifiers (a fully connected multilayer, Support Vector Machine and Random Forests) and work with four datasets (5, 10, 20 or 30 classes), for a total of 53, 000 tiles at 512 × 512 resolution. We analyze accuracy and feature stability of the machine learning classifiers, also demonstrating the need for diagnostic tests ( e.g ., random labels) to identify selection bias and risks for reproducibility. Further, we use the deep features from the VGG model from GTEx on the KIMIA24 dataset for identification of slide of origin (24 classes) to train a classifier on 1, 060 annotated tiles and validated on 265 unseen ones. The DAPPER software, including its deep learning pipeline and the Histological Imaging—Newsy Tiles (HINT) benchmark dataset derived from GTEx, is released as a basis for standardization and validation initiatives in AI for digital pathology.
In this paper, we present the results of a manual assessment on the coherence\ud between the comments and the implementation of 3636 methods in three open source soft-\ud ware applications (for one of these applications, we considered two different subsequent\ud versions) implemented in Java. The results of this assessment have been collected in a\ud dataset we made publicly available on the Web. The creation of this dataset is based on a\ud protocol that is detailed in this paper. We present that protocol to let researchers evaluate\ud the goodness of our dataset and to ease its future possible extensions. Another contribution\ud of this paper consists in preliminarily investigating on the effectiveness of adopting a Vec-\ud tor Space Model (VSM) with the tf-idf schema to discriminate coherent and non-coherent\ud methods. We observed that the lexical similarity alone is not sufficient for this distinc-\ud tion, while encouraging results have been obtained by applying an Support Vector Machine\ud (SVM) classifier on the whole vector space
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.