“…Different Skills and Insight Tests were used at the different time points (grades), which were linked through anchor items. Additionally, to control for order effects, in each grade four different test versions were used, with different orders of items and different subsets of the total set of test items (for more information, see the dataset documentation and Bakker et al , ). A scaling procedure (mean‐mean linking, see Kolen & Brennan, ) was used to put the scores of the different test versions and different timepoints on a common scale, one for Skills and one for Insight, leading to weighted likelihood estimation (WLE) scores for each test (see Wu, Adams, Wilson & Haldane, ).…”