Ambient Intelligence (AmI) encompasses technological infrastructures capable of sensing data from environments and extracting high-level knowledge to detect or recognize users’ features and actions, as well as entities or events in their surroundings. Visual perception, particularly object detection, has become one of the most relevant enabling factors for this context-aware user-centered intelligence, being the cornerstone of relevant but complex tasks, such as object tracking or human action recognition. In this context, convolutional neural networks have proven to achieve state-of-the-art accuracy levels. However, they typically result in large and highly complex models that typically demand computation offloading onto remote cloud platforms. Such an approach has security- and latency-related limitations and may not be appropriate for some AmI use cases where the system response time must be as short as possible, and data privacy must be guaranteed. In the last few years, the on-device paradigm has emerged in response to those limitations, yielding more compact and efficient neural networks able to address inference directly on client machines, thus providing users with a smoother and better-tailored experience, with no need of sharing their data with an outsourced service. Framed in that novel paradigm, this work presents a review of the recent advances made along those lines in object detection, providing a comprehensive study of the most relevant lightweight CNN-based detection frameworks, discussing the most paradigmatic AmI domains where such an approach has been successfully applied, the different challenges arisen, the key strategies and techniques adopted to create visual solutions for image-based object classification and localization, as well as the most relevant factors to bear in mind when assessing or comparing those techniques, such as the evaluation metrics or the hardware setups used.