Smart eyewear (e.g., AR glasses) is considered to be the next big breakthrough for wearable devices. The interaction of state-of-the-art smart eyewear mostly relies on the touchpad which is obtrusive and not user-friendly. In this work, we propose a novel acoustic-based upper facial action (UFA) recognition system that serves as a hands-free interaction mechanism for smart eyewear. The proposed system is a glass-mounted acoustic sensing system with several pairs of commercial speakers and microphones to sense UFAs. There are two main challenges in designing the system. The first challenge is that the system is in a severe multipath environment and the received signal could have large attenuation due to the frequency-selective fading which will degrade the system's performance. To overcome this challenge, we design an Orthogonal Frequency Division Multiplexing (OFDM)-based channel state information (CSI) estimation scheme that is able to measure the phase changes caused by a facial action while mitigating the frequency-selective fading. The second challenge is that because the skin deformation caused by a facial action is tiny, the received signal has very small variations. Thus, it is hard to derive useful information directly from the received signal. To resolve this challenge, we apply a time-frequency analysis to derive the time-frequency domain signal from the CSI. We show that the derived time-frequency domain signal contains distinct patterns for different UFAs. Furthermore, we design a Convolutional Neural Network (CNN) to extract high-level features from the time-frequency patterns and classify the features into six UFAs, namely, cheek-raiser, brow-raiser, brow-lower, wink, blink and neutral. We evaluate the performance of our system through experiments on data collected from 26 subjects. The experimental result shows that our system can recognize the six UFAs with an average F1-score of 0.92.