BackgroundComputational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows.ResultsWe have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis.ConclusionsReproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1747-0) contains supplementary material, which is available to authorized users.
Computational bioinformatics workflows are extensively used to analyse genomics data. With the unprecedented advancements in genomic sequence technology and opportunities for personalized medicines, it is essential that analysis results are repeatable by others, especially when moving into clinical environment. To cope with the complex computational demands of huge biological datasets, a shift to distributed compute resources is unavoidable. A case study was conducted in which three wellestablished bioinformatics analysis groups across Australia were assigned to analyse exome sequence data from a range of patients with a rare condition: disorder of sex development. Initially these groups used their own in-house data processing pipelines, and subsequently used a common bioinformatics workbench based upon Galaxy and offered through the Australia-wide National eResearch Collaboration Tools and Resources (NeCTAR) Research Cloud. This paper describes the experiences in this work and the variability of results. We put forward principles that should be used to ensure reproducibility of scientific results moving forward.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.