The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on optimizing the control in either traffic light signal (to improve the intersection throughput), or vehicle speed (to stabilize the traffic), this paper presents a multi-agent Deep Reinforcement Learning (DRL) system called CoTV, which Cooperatively controls both Traffic light signals and Connected Autonomous Vehicles (CAV). Therefore, our CoTV can well balance the reduction of travel time, fuel, and emissions. CoTV is also scalable to complex urban scenarios by cooperating with only one CAV that is nearest to the traffic light controller on each incoming road. This avoids costly coordination between traffic light controllers and all possible CAVs, thus leading to the stable convergence of training CoTV under the large-scale multi-agent scenario. We describe the system design of CoTV and demonstrate its effectiveness in a simulation study using SUMO under various grid maps and realistic urban scenarios with mixed-autonomy traffic.