μEMA is a data collection method that prompts research participants with quick, answer-at-a-glance, single-multiple-choice self-report behavioral questions, thus enabling high-temporal-density self-report of up to four times per hour when implemented on a smartwatch. However, due to the small watch screen, μEMA is better used to select among 2 to 5 multiple-choice answers versus allowing the collection of open-ended responses. We introduce an alternative and novel form of micro-interaction self-report using speech input - audio-μEMA- where a short beep or vibration cues participants to verbally report their behavioral states, allowing for open-ended, temporally dense self-reports. We conducted a one-hour usability study followed by a within-subject, 6-day to 21-day free-living feasibility study in which participants self-reported their physical activities and postures once every 2 to 5 minutes. We qualitatively explored the usability of the system and identified factors impacting the response rates of this data collection method. Despite being interrupted 12 to 20 times per hour, participants in the free-living study were highly engaged with the system, with an average response rate of 67.7% for audio-μEMA for up to 14 days. We discuss the factors that impacted feasibility; some implementation, methodological, and participant challenges we observed; and important considerations relevant to deploying audio-μEMA in real-time activity recognition systems.