Areal data is a common data type to store information such as biodiversity inventories, socioeconomic censuses or cadastral surveys. Many research questions require that areal data are integrated from multiple heterogeneous sources. Inconsistent concepts, terms, definitions, or messy tables makes data wrangling an often tedious and error-prone process. A dedicated tool that assists in organising areal data still is lacking. Here, we introduce the R package arealDB that helps to harmonise and integrate heterogeneous areal data and associated geometries into a consistent database. The package is used to collect metadata, harmonise language and variable names, reshape messy into tidy data and integrate them in a standard data format. arealDB solves the specific problem of integrating disparate regional data sources on a given target variable, which may be published in different languages, with a different table arrangement or provided in various data formats. We guide the user step by step through the individual functions needed to integrate two such datasets using the example of the harvested area of soybean in Brazil and the USA. A database that has been built with arealDB is "tidy", and can thus be accessed easily with powerful and widespread tools such as the R meta-package tidyverse. Moreover, it is accompanied by provenance documentation that traces the full process of creation for each data point in the database. By offering easy-to-use tools for integrating areal data, arealDB promises substantial time-savings to database collation efforts, as well as quality-improvements to downstream scientific, monitoring, and management applications.Keywords disorganised messy data • interoperability • data integration • relational database • census and survey data • metadata • provenance documentation • zonal data
IntroductionAreal data capture phenomena of interest at the level of finite spatial units. They are an essential data type in many basic and applied research fields, for example, to project human populations or to analyse the spread of infectious diseases based on census or survey data, or to map global biodiversity patterns based on species checklists. Areal data also play a crucial role in various policy and management applications such as national progress reporting towards Sustainable Development Goals (SDGs), to assess the implications of macroeconomic policies based on international trade statistics or to document land ownership. Through illustrative maps in news or education media, areal data are also an everyday communication tool in civil society.Many critical applications of areal data surpass the spatial, temporal or thematic scope of any unique data source, which makes it necessary to integrate heterogeneous sources into a single, more comprehensive database (Otto et al. [2015]). Efforts of harmonising and integrating areal data are carried out by numerous organisations such as the Food and Agriculture Organization, the World Bank Group and many smaller projects.However, integrating areal data from...