Gun muzzle flash produces characteristic signatures on the 766nm and 769nm wavelengths that can be passively picked up from a distance using ultra-sensitive SPAD arrays for immediate localization. Sifting through the massive number of pulses generated by the arrays in real-time however poses a challenge, especially when deep-learning models are used for classification. We present a novel FPGA-based expandable system consisting of a two-tier detection architecture that decouples the computationally-intensive deep-learning model from the data rate intensive SPAD arrays. Our slope-based first tier algorithm provides an FPGA-efficient first-look filter and our ResNet-based deep-learning model provides high sensitivity across different lighting conditions while maintaining high specificity in the face of potential false positives in an urban environment. The deep-learning model was trained with synthetic datasets generated from small samples of gun muzzle flashes from various weapons and ammunition types available to us, and sources of likely false positives in an urban environment. In testing, our system achieves a detection rate of 99.8%, 99.9% specificity and 99.6% sensitivity for shots fired from distances between 50 to 450m.