In this study, we evaluated the feasibility of using zero-shot classification models for activity recognition in a Digital Sommelier. Our experiment involved preprocessing video data by extracting frames and categorizing user activities related to a wine-tasting scenario. Image classification models demonstrated high accuracy, nearing 90%, in distinguishing between "engaged" and "disengaged" states. However, video classification models presented a lower performance in classifying user activities such as "observing wine," "smelling wine," and "sipping wine," with an average accuracy of around 50% due to the interdependent nature of the activities. Despite these challenges, our findings highlight the potential of zeroshot classification models in enhancing virtual assistants' ability to recognize and respond to user activities.