Background We present a software tool, the Container Profiler, that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of a containerized job by collecting Linux operating system metrics at the virtual machine, container, and process levels. The Container Profiler can produce utilization snapshots at multiple time points, allowing for continuous monitoring of the resources consumed by a container workflow. Results To investigate the utility of the Container Profiler we profiled the resource utilization requirements of a multi-stage bioinformatics analytical workflow (RNA sequencing using unique molecular identifiers). We examined the collected profile metrics and confirmed that they were consistent with the expected CPU, disk, network resource utilization patterns for the different stages of the workflow. We also quantified the profiling overhead and found that this was negligible. Conclusions The Container Profiler is a useful tool that can be used to continuously monitor the resource consumption of long and complex containerized workflows that run locally or on the cloud. This can identify bottlenecks where more resources are needed to improve performance.
Large scale data resources such as the NCI's Cancer Research Data Commons (CRDC) and the Genotype-Tissue Expression (GTEx) portal have the potential to simplify the analysis of cancer data by providing data that can be used as standards or controls. However, comparisons with data that is processed using different methodologies or even different versions of software, parameters and supporting datasets can lead to artefactual results. Reproducing the exact workflows from text-based standard operating procedures (SOPs) is problematic as the documentation can be incomplete or out of date, especially for complex workflows involving many executables and scripts. We extend the Biodepot-workflow-builder (Bwb) platform to distribute the computational methodology with integrated data access to the National Cancer Institute (NCI) Genomic Data Commons (GDC). We have converted the GDC DNA sequencing (DNA-Seq), the GDC mRNA-Seq SOPs into reproducible, self-installing, containerized graphical workflows that users can apply to their custom datasets. Secure access to CRDC data is provided using the Data Commons Framework Services (DCFS) Gen3 protocol. The user can perform the analysis on their laptop, desktop or use their preferred cloud provider to access the computational and network resources available on the cloud. We demonstrate the impact of non-uniform analysis of control and treatment data for the inference of differentially expressed genes. Most importantly, we also provide a dynamic and practical solution for uniform and reproducible reprocessing of omics data allowing cancer researchers to take full advantage across multiple data resources such as the CRDC and GTEx.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.