Tel: +61 8 93606577Running head: comparing two methods of rater scales for behaviour assessment Word count: 5,300 words
Highlights: Reliability of rater assessments depends on understanding how observers apply descriptive terms. We compared two methodologies, Fixed List (FL) and Free Choice Profiling. Observers reached consensus using either FL or FCP methods. There were correlations in scores attributed to groups of sows between FL and FCP. Training is an important aspect of reliability of rater assessments.Abstract. Qualitative methods of behavioural assessment use observer rating scales to score the overall demeanour or body language of animals. Establishing the reliability of such holistic approaches requires test and validation of the methods used. Here, we compare two methodologies used in Qualitative Behavioural Assessment (QBA): Fixed-Lists (FL) and Free-Choice Profiling (FCP). A laboratory class of 27 students was separated into two groups of 17 and 10 students (FL and FCP respectively). The FL group were given a list of 20 descriptive terms (used by the European Union's Welfare Quality ® program), shown videos of group-housed sows, and as a group discussed how they would apply the descriptive terms in an assessment. The FCP group were shown the same footage but individually generated their own descriptive terms to describe body language of the animals. Both groups were then shown 18 video clips of group-housed sows and scored each clip using a visual analogue scale (VAS) system. We analysed the VAS scores using Generalised Procrustes Analysis (GPA) for each observer group separately, which indicated high inter-observer reliability for both groups (FL: 71.1% of scoring variation explained, and FCP: 63.5%). There were significant correlations between FL and FCP scores (GPA dimension 1: r 16 =0.946, P<0.001, GPA dimension 2: r 16 =0.477, P=0.045). Additional analysis of the raw VAS scores for the FL group by Principal Component Analysis (PCA) produced four factors; PC1 scores were correlated with GPA1 (r 16 =0.984, P<0.001) and PC3 scores correlated with GPA2 (r 16 =0.880,
P<0.001).Kendall's coefficient of concordance (a measure of observer agreement) of the VAS scores indicated statistically significant agreement in use of the 20 descriptive terms (W range 0.37-0.64; all significant at P<0.001, although a value of W >0.7 is usually accepted to show strong agreement). This study demonstrates that, regardless of whether they are given their terms or are allowed to generate their own, observers score sow body language in a similar way. Strengths and weaknesses within the two methods were identified, which highlight the importance of providing thorough and consistent training of observers, including providing good quality training footage so that the full repertoire of demeanours can be identified.