Abstract. In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study.
Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight.
Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators.
Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication.
We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices.
Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.