Abstract-Video-conferencing is becoming an essential part in everyday life. The visual channel allows for interactions which were not possible over audio-only communication systems such as the telephone. However, being a de-facto over-the-top service, the quality of the delivered video-conferencing experience is subject to variations, dependent on network conditions. Video-conferencing systems adapt to network conditions by changing for example encoding bitrate of the video. For this adaptation not to hamper the benefits related to the presence of a video channel in the communication, it needs to be optimized according to a measure of the Quality of Experience (QoE) as perceived by the user. The latter is highly dependent on the ongoing interaction and individual preferences, which have hardly been investigated so far. In this paper, we focus on the impact video quality has on conversations that revolve around objects that are presented over the video channel. To this end we conducted an empirical study where groups of 4 people collaboratively build a Lego® model over a video-conferencing system. We examine the requirements for such a task by showing when the interaction, measured by visual and auditory cues, changes depending on the encoding bitrate and loss. We then explore the impact that prior experience with the technology and affective state have on QoE of participants. We use these factors to construct predictive models which double the accuracy compared to a model based on the system factors alone. We conclude with a discussion of how these factors could be applied in real world scenarios.Index Terms-Multi-Party video conferencing, Quality of Experience, Over-the-top, subjective quality, quality metrics, user study