13Social interactions powerfully impact both the brain and the body, but high-resolution descriptions of these 14 important physical interactions are lacking. Currently, most studies of social behavior rely on labor-intensive 15 methods such as manual annotation of individual video frames. These methods are susceptible to experimenter 16 bias and have limited throughput. To understand the neural circuits underlying social behavior, scalable and 17 objective tracking methods are needed. We present a hardware/software system that combines 3D videography, 18 deep learning, physical modeling and GPU-accelerated robust optimization. Our system is capable of fully 19 automatic multi-animal tracking during naturalistic social interactions and allows for simultaneous electro-20 physiological recordings. We capture the posture dynamics of multiple unmarked mice with high spatial (~2 21 mm) and temporal precision (60 frames/s). This method is based on inexpensive consumer cameras and is 22 implemented in python, making our method cheap and straightforward to adopt and customize for studies of 23 neurobiology and animal behavior. 24
RESULTS 67
Raw data acquisition 68We established an experimental setup that allowed us to capture synchronized color images and depth images 69 from multiple angles, while simultaneously recording synchronized neural data ( Fig. 1a). We used inexpen-70 sive, state-of-the-art 'depth cameras' for computer vision and robotics. These cameras contain several imaging 71 modules: one color sensor, two infrared sensors and an infrared laser projector ( Fig. 1b). Imaging data pipe-72 lines, as well as intrinsic and extrinsic sensor calibration parameters can be accessed over USB through a 73 C/C++ SDK with Python bindings. We placed four depth cameras, as well as four synchronization LEDs 74 around a transparent acrylic cylinder which served as our behavioral arena ( Fig. 1c). 75 76 Each depth camera projects a static dot pattern across the imaged scene, adding texture in the infrared spec-77 trum to reflective surfaces ( Fig. 1d). By imaging this highly-textured surface simultaneously with two infrared 78 sensors per depth camera, it is possible to estimate the distance of each pixel in the infrared image to the depth 79 camera by stereopsis (by locally estimating the binocular disparity between the textured images). Since the 80 dot pattern is static and only serves to add texture, multiple cameras do not interfere with each other and it is 81 possible to image the same scene from multiple angles. This is one key aspect of our method, not possible 82 with depth imaging systems that rely on actively modulated light (such as the Microsoft Kinect system and 83 earlier versions of the Intel Realsense cameras). 84
85Since mouse movement is fast 13 , it is vital to minimize motion blur in the infrared images and thus the final 86 3D data ('point-cloud'). To this end, our method relies on two key features. First, we use depth cameras where 87 the infrared sensors have a global shutter (e.g., Intel D435) rathe...