Tests or test batteries used for assessing motor skills, either in research studies or in clinical settings, apply a variety of procedures for scoring performances, including everything from one to ten attempts, of which the best is scored or an average is computed. The rationale behind scoring procedures is rarely stated, and it seems that the number of attempts allowed is decided without much qualification from research. It is uncertain whether procedures fairly capture an individual’s skill level. Thus, the validity of the tests may be compromised. The present study tested 24 young female soccer players on the juggling of a soccer ball. They were given 10 attempts, and trials were scored according to nine different procedures including the ‘best of’ or ‘mean of’ either one, two, three, five, or ten attempts. Individual raw scores differed widely across trials, but no general effect of trials was found. The mean (SD) percentage difference between the lowest and highest scores was 27.7(9.9)%, with 17 players (71%) demonstrating a significant change from lowest to highest score. Correlations between raw scores were low across trials, while they were generally higher across scoring procedures. The first trial was significantly different from the remaining both as a raw score and as scoring procedure. The mean percentage difference between best-of-two and best-of-ten scores was 95%, with 50 % of the players demonstrating a significant difference between the two scoring procedures. No significant differences were found across mean-of-rule scorings. Best-of-rule and mean-of-rule scorings were significantly different except for the best-of-two vs. mean-of-two. The mean difference between highest and lowest rank across players was 6.7 (3.6), with individual rankings within the group varying 33% on average across procedures. One player moved from 3rd to 23rd place because of procedural differences. Therefore, it is concluded that scoring procedures affect results and may have an impact on test outcomes. This may present consequences for decision-making from test results, such as diagnosing and selection of intervention groups. We hope that our results would inspire further research into the scoring procedures of the vast amount of tests and tasks in common use.