The large and diverse access to data sources in healthcare has boosted the application of novel computer techniques that can extract meaningful information to improve patients' prognoses and other important medical uses. However, most of the current systems require the professional to type the information in a manual and time-consuming manner, increasing the risk of transcription errors and cross-contamination. One solution is to create an automated system that allows healthcare professionals to dictate clinical information that can be transcribed and analyzed. Since most of the systems proposed so far have been developed for the English language, in this paper, we propose a unified system for automatically recording, transcribing, and identifying key information content from audio in Spanish. Specifically, we propose a two-step model consisting of a commercial Speech-to-Text API and an in-house Named Entity Recognition model trained on Spanish clinical narratives. Transcription performance was evaluated using the Word Error Rate, considering a gold standard composed of 90 manually annotated texts belonging to four domains: general, medical, and dental. Regarding entity detection, we used the F1 score to compare the clinical entities identified by our model with manual annotations performed by healthcare professionals. Finally, to better understand the limitations of our system, we performed a detailed error analysis from a linguistic and computational point of view. We share the annotated referrals, audio, and transcriptions to promote the reproducibility of our results.