Augmented Reality and Virtual Reality have emerged as the next frontier of intelligent image sensors and computer systems. In these systems, 3D die stacking stands out as a compelling solution, enabling in-situ processing capability of the sensory data for tasks such as image classification and object detection at low power, low latency, and a small form factor. These intelligent 3D CMOS Image Sensor (CIS) systems present a wide design space, encompassing multiple domains (e.g., computer vision algorithms, circuit design, system architecture, and semiconductor technology, including 3D stacking) that have not been explored in-depth so far. This paper aims to fill this gap. We first present an analytical evaluation framework, STAR-3DSim, dedicated to rapid pre-RTL evaluation of 3D-CIS systems capturing the entire stack from the pixel layer to the on-sensor processor layer. With STAR-3DSim, we then propose several knobs for PPA (power, performance, area) improvement of the Deep Neural Network (DNN) accelerator that can provide up to 53%, 41%, and 63% reduction in energy, latency, and area, respectively, across a broad set of relevant AR/VR workloads. Lastly, we present full-system evaluation results by taking image sensing, cross-tier data transfer, and off-sensor communication into consideration.