FPGAs are considered a valuable solution for embedded system applications thanks to their performance, energy efficiency and capability to face system failures. However, the number of available applications is limited due to the learning curve needed to customize FPGA-based accelerators. As proof of this, Xilinx recently released PYNQ, a platform for Zynq SoC that relies on Python and overlays to ease the integration of functionalities of the programmable logic into applications. In this work, we build upon this framework to implement an optimized embedded design for audio alignment and we integrated it in the Python applications workflow. In particular, we provide a custom accelerator designed for PYNQ and the software interface to transparently exploit the programmable logic from the Python code runs on the embedded CPU. We then compare the executions on two different devices: the PYNQ-Z1 and the Raspberry Pi 3. Our FPGA accelerated implementation is able to reach a speedup of 12.4x with respect to the PYNQ-Z1, when only the CPU is used, and a speedup of 5.5x with respect to the Raspberry Pi 3 version.