In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
19In recent years, the explosion of genomic data and bioinformatic tools has been accompanied 20 by a growing conversation around reproducibility of results and usability of software. However, 21 the actual state of the body of bioinformatics software remains largely unknown. The purpose of 22 this paper is to investigate the state of source code in the bioinformatics community, specifically 23 looking at relationships between code properties, development activity, developer communities, 24 and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics 25 repositories on GitHub through their mention in peer-reviewed bioinformatics articles. 26Additionally, we included 23 high-profile repositories identified by their popularity in an online 27 bioinformatics forum. We analyzed repository metadata, source code, development activity, and 28 team dynamics using data made available publicly through the GitHub API, as well as article 29 metadata. We found key relationships within our dataset, including: certain scientific topics are 30 associated with more active code development and higher community interest in the repository; 31 most of the code in the main dataset is written in dynamically typed languages, while most of 32 the code in the high-profile set is statically typed; developer team size is associated with 33 community engagement and high-profile repositories have larger teams; the proportion of 34 female contributors decreases for high-profile repositories and with seniority level in author lists; 35 and, multiple measures of project impact are associated with the simple variable of whether the 36 code was modified at all after paper publication. In addition to providing the first large-scale 37 analysis of bioinformatics code to our knowledge, our work will enable future analysis through 38 publicly available data, code, and methods. Code to generate the dataset and reproduce the 39 analysis is provided under the MIT license at https://github.com/pamelarussell/github-40 bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8. 41 42 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint . http://dx.doi.org/10.1101/321919 doi: bioRxiv preprint first posted online May. 14, 2018; We present, to our knowledge, the first large-scale analysis of bioinformatics source code. The 46 purpose of our work is to contribute data to the growing conversation in the bioinformatics 47 community around reproducibility, code quality, and software usability. We analyze a large 48 collection of bioinformatics software projects, identifying relationships between code properties, 49 development activity, developer communities, and software impact. Throughout the work, we 50 compare the large set of projects to a small set of highly popular bioinformatics tools, 51 highlighting features associated with hig...
ObjectiveMany tools have been developed to profile microRNA (miRNA) expression from small RNA-seq data. These tools must contend with several issues: the small size of miRNAs, the small number of unique miRNAs, the fact that similar miRNAs can be transcribed from multiple loci, and the presence of miRNA isoforms known as isomiRs. Methods failing to address these issues can return misleading information. We propose a novel quantification method designed to address these concerns.ResultsWe present miR-MaGiC, a novel miRNA quantification method, implemented as a cross-platform tool in Java. miR-MaGiC performs stringent mapping to a core region of each miRNA and defines a meaningful set of target miRNA sequences by collapsing the miRNA space to “functional groups”. We hypothesize that these two features, mapping stringency and collapsing, provide more optimal quantification to a more meaningful unit (i.e., miRNA family). We test miR-MaGiC and several published methods on 210 small RNA-seq libraries, evaluating each method’s ability to accurately reflect global miRNA expression profiles. We define accuracy as total counts close to the total number of input reads originating from miRNAs. We find that miR-MaGiC, which incorporates both stringency and collapsing, provides the most accurate counts.Electronic supplementary materialThe online version of this article (10.1186/s13104-018-3418-2) contains supplementary material, which is available to authorized users.
The Cancer Imaging Archive (TCIA) hosts publicly available deidentified medical images of cancer from over 25 body sites and over 30,000 patients. Over 400 published studies have utilized freely available TCIA images. Images and metadata are available for download through a web interface or a REST API. Here, we present TCIApathfinder, an R client for the TCIA REST API. TCIApathfinder wraps API access in user-friendly R functions that can be called interactively within an R session or easily incorporated into scripts. Functions are provided to explore the contents of the large database and to download image files. TCIApathfinder provides easy access to TCIA resources in the highly popular R programming environment. TCIApathfinder is freely available under the MIT license as a package on CRAN (https://cran.r-project.org/web/packages/TCIApathfinder/index.html) and from https://github.com/pamelarussell/TCIApathfinder These findings present a new tool, TCIApathfinder, the first client for The Cancer Imaging Archive (TCIA) for use in the highly popular R computing environment, that will dramatically lower the barrier of access to the valuable tools in TCIA. .
MotivationBioinformaticians frequently navigate among a diverse set of coordinate systems: for example, converting between genomic, transcript, and protein coordinates. The abstraction of coordinate systems and feature arithmetic allows genomic workflows to be expressed more elegantly and succinctly. However, no publicly available software library offers fully featured interoperable support for multiple coordinate systems. As such, bioinformatics programmers must either implement custom solutions, or make do with existing utilities, which may lack the full functionality they require.ResultsWe present BioCantor, a Python library that provides integrated library support for arbitrarily related coordinate systems and rich operations on genomic features, with I/O support for a variety of file formats.Availability and implementationBioCantor is implemented as a Python 3 library with a minimal set of external dependencies. The library is freely available under the MIT license at https://github.com/InscriptaLabs/BioCantor and on the Python Package Index at https://pypi.org/project/BioCantor/. BioCantor has extensive documentation and vignettes available on ReadTheDocs at https://biocantor.readthedocs.io/en/latest/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.