New technologies for manipulating and recording the nervous system allow us to 1 perform unprecedented experiments. However, the influence of our experimental 2 manipulations on psychological processes must be inferred from their effects on 3 behavior. Today, quantifying behavior has become the bottleneck for large-scale, high 4 throughput, experiments. The method presented here addresses this issue by using deep 5 learning algorithms for video-based animal tracking. Here we describe a reliable 6 automatic method for tracking head position and orientation from simple video 7 recordings of the common marmoset (Callithrix jacchus). This method for measuring 8 marmoset behavior allows for the indirect estimation of gaze within foveal error, and 9 can easily be adapted to a wide variety of similar tasks in biomedical research. In 10 particular, the method has great potential for the simultaneous tracking of multiple 11 marmosets to quantify social behaviors. 12 Introduction 13Recent technological developments allow us to manipulate [3,35,40] and record [7,9,53] 14 the nervous system with historically unprecedented precision and scale. Yet, the 15 psychological relevance of our sophisticated manipulations and large-scale recordings 16 can only be inferred from their effect on behavior. Today, quantifying behavior has 17 become the bottleneck for high-throughput experiments [5,48]. Common practice is to 18 apply standard tests designed to measure psychological constructs such as anxiety and 19 memory, for example, by using the Elevated Plus Maze [34] or the Morris Water 20 Maze [28]. Such testing requires animals to be individually handled, making data 21 acquisition labor intensive, increasing costs and reducing experimental throughput.
22Alternatively, various simple detectors (e.g. capacitance sensors or photo-beams) can be 23 arranged to automatically acquire data at specific sites (e.g. drinking [12] and 24 feeding [10] stations). This type of automatization allows for high throughput, but fails 25 to capture complex or subtle behaviors such as social interactions and gaze behaviors. A 26 more promising approach is to record high-dimensional data from sensor arrays (e.g.
27video) and extract relevant information using computer vision algorithms. This 28 approach has the potential to provide a better characterization of behavior, capable of 29 automatically capturing complex and subtle behaviors, while simultaneously reducing 30 both cost and labor intensity [39]. Raw video frames are composed of a high number of 31 pixels whose values do not straightforwardly correlate with an animal's behavior. In