This paper presents a new toolkit for assessing Theory of Mind (ToM) via performance in first and second-order false belief (FB) tasks. The toolkit includes verbal and non-verbal versions of first and second-order FB tasks; the verbal version is currently available in Greek and German. Scenarios in the toolkit are balanced for factors that may influence performance, like the reason for the FB (deception, change-of-location, unexpected content). To validate our toolkit, we tested the performance of neurotypical adults in the non-verbal and verbal versions in two studies: Study 1 with 50 native speakers of German and Study 2 with 50 native speakers of Greek. The data from both studies yield similar results. Participants performed well in all conditions, showing slightly more difficulties in the second- than first-order FB conditions, and in the non-verbal than the verbal version of the task. This suggests that the task is at the high end of the sensitive range for neurotypical adults, and is expected to be well inside the sensitive range for children and populations that have difficulties in ToM. Factors like deception and type of outcome in the video-scenarios did not influence the behavior of neurotypical adults, suggesting that the task does not have any confounds related to these factors. The order of presentation of the verbal and non-verbal version has an influence on performance; participants beginning with the verbal version performed slightly better than participants beginning with the non-verbal version. This suggests that neurotypical adults used language to mediate ToM performance and learn from a language-mediated task when performing a non-verbal ToM task. To conclude, our results show that the scenarios in the toolkit are of comparable difficulty and can be combined freely to match demands in future research with neurotypical children and autistic individuals, as well as other populations that have been shown to have difficulties in ToM. Differences between baseline and critical conditions can be assumed to reflect ToM abilities, rather than language and task-based confounding factors.