Income is an important economic indicator to measure living standards and individual well-being. In Germany, different data sources yield ambiguous evidence for analyzing the income distribution. The Tax Statistics (TS)—an income register recording the total population of more than 40 million taxpayers in Germany for the year 2014—contains the most reliable income information covering the full income distribution. However, it offers only a limited range of socio-demographic variables essential for income analysis. We tackle this challenge by enriching the tax data with information on education and working time from the Microcensus, a representative 1 percent sample of the German population. We examine two types of data fusion methods well suited to the specific data fusion scenario of the TS and the Microcensus: missing-data methods and performant prediction models. We conduct a simulation study and provide an empirical application comparing the proposed data fusion methods, and our results indicate that Multinomial Regression and Random Forest are the most suitable methods for our data fusion scenario.
Poor coverage of top incomes in surveys, also referred to as the “missing rich” problem, leads to severe underestimation of income inequality. At the regional level this shortcoming is even more eminent due to small regional sample sizes. Tax records contain more accurate income information at the top and cover all regions equally well. Top-income correction approaches tackle the missing rich problem by imputing top incomes from tax to survey data. While existing methods focus on adjustments at the national level, our paper provides corrections of the regional income distributions in survey data by exploiting the tax data’s regional variability. We impute top incomes in the survey data from the German Microcensus based on region-specific Pareto and generalized Pareto distributions estimated from tax records. The combined survey and tax data provide new estimates of regional income inequality in Germany. Our findings indicate that inequality between and within the regions is much larger than previously understood with the magnitude of the adjustment depending on the federal states’ level of inequality in the tail.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.