Electronic access to multiple data types, from generic information on biological systems at different functional and cellular levels to high-throughput molecular data from human patients, is a prerequisite of successful systems medicine research. However, scientists often encounter technical and conceptual difficulties that forestall the efficient and effective use of these resources. We summarize and discuss some of these obstacles, and suggest ways to avoid or evade them.The methodological gap between data capturing and data analysis is huge in human medical research. Primary data producers often do not fully apprehend the scientific value of their data, whereas data analysts maybe ignorant of the circumstances under which the data were collected. Therefore, the provision of easy-to-use data access tools not only helps to improve data quality on the part of the data producers but also is likely to foster an informed dialogue with the data analysts.We propose a means to integrate phenotypic data, questionnaire data and microbiome data with a user-friendly Systems Medicine toolbox embedded into i2b2/tranSMART. Our approach is exemplified by the integration of a basic outlier detection tool and a more advanced microbiome analysis (alpha diversity) script. Continuous discussion with clinicians, data managers, biostatisticians and systems medicine experts should serve to enrich even further the functionality of toolboxes like ours, being geared to be used by ‘informed non-experts’ but at the same time attuned to existing, more sophisticated analysis tools.
i2b2's promise is to enable clinical researchers to devise and test new hypothesis even without a deep knowledge in statistical programing. The approach presented here has been tested in a number of scenarios with millions of observations and tens of thousands of patients. Initially mostly observant, trained researchers were able to construct new analyses on their own. Early feedback indicates that timely and extensive access to their "own" data is appreciated most, but it is also lowering the barrier for other tasks, for instance checking data quality and completeness (missing data, wrong coding).
In this opinion paper we provide an overview of some challenges concerning data provenance in biomedical research. We reflect current literature and depict some examples of existing implicit or explicit provenance aspects in some standard data types in translational research. Furthermore, we assess the need of further data provenance standardization in biomedical informatics. Basic data provenance should provide a recall about the origin of the data, transformation process steps, support replication and presentation of the data. Even though usable concepts for the documentation of data provenance can be found in other fields as early as 2005, the penetration rate in biomedical projects and in the biomedical literature is quite low. The awareness for the necessity of basic data provenance has to be raised, the education of data managers has to be further improved.
ObjectivesEUReMS (European Register for Multiple Sclerosis), a project including more than ten national and regional European MS registers, is aiming to enable analyses across European registers by joining existing, heterogeneous MS data in four different studies. Each participating register delivered productive data comprising information on socio-demography, disease course, medical exams or treatment. In terms of data quality, especially comparability and integrity, a data handling routine has been implemented using an open source ETL (extract transform load) tool ("Talend Open Studio") to process the large amounts of heterogeneous raw data. That approach will be presented. ApproachAs a first step in harmonizing datasets of different registers, a basic EUReMS data structure was defined for each of the four project studies, considering all information required to answer the research questions. Through the data handling process, the data exports are going to be converted into the prior defined study data structure to facilitate comparability and data analyses across the various registers participating in one study. In regard to quality assurance the data handling process has been validated before providing data for analyses. ResultsThe data handling process consists of five steps: Reading/Splitting, Cleaning, Mapping and Creating Study Datasets. During the first step, data is read and split into variables that are going to be used within the study datasets. The heterogeneity of the data is again noticeable in the data types of the source files, ranging from csv or Excel to Access Database. During the cleaning step, data is checked for incorrect or missing values and are, as a way of ensuring traceability, saved in specific reject files. In the mapping step, register specific variables are mapped to the defined EUReMS denotations. By that, the heterogeneous data is harmonized, disabling misinterpretation of register specific variables, often in national language or unfamiliar abbreviations. The data is merged into study datasets that are uniform in appearance for each study and are provided to the statistical department for analyses in order to gain insight on disease related questions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.