For the past thirty years, qualitative psychology researchers have focused on the study of written or spoken word while relegating the study of visual communications to children or those deemed unable to speak (Reavey, 2021, p. 2). Thus, the discipline now pays much attention to the collection and analysis of spoken or written words but significantly less to visual and auditory expressions of experience (Reavey, 2021, p. 3), and most transcription methods for psychology researchers are those designed for interviews that only capture the spoken word. However, these transcription methods have yet to account for the current context of ubiquitous, technologically mediated interactions. People from diverse groups use social media platforms such as YouTube and TikTok to interact using speech, audio and video. While they offer rich data for qualitative psychology researchers, the tools to capture such multimodal expressions are still in early stages of development within the discipline (Marshall et al., 2021). In this article, we present a transcription structure that allows for the recording of both speech and visual elements in audiovisual content. Inspired by methods from communications and visual anthropology, the Four Column Analysis Structure (or, FoCAS) allows for the simultaneous analysis of both audio and visual data by allowing for the transcription of four dimensions: (1) timestamp, (2) setting, (3) scene, and (4) audio. Based on its application in two completed studies and one study in progress, we describe the development of the FoCAS, how to set it up, transcription conventions, and how to analyze qualitative data using all four columns. We additionally discuss sampling considerations and the advantages and disadvantages of the structure. By expanding the amount of meaningful data that can be captured by qualitative transcription, we hope the FoCAS can be used to create more multidimensional, rigorous analyses of audiovisual data.