SM9 was established in 2016 as a Chinese official identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a centralized processing of client data (e.g. IoT cloud) is often done by gateways. However, due to limited computation resources inside IoT devices, the performance of SM9 becomes a bottleneck in practical usage. The existing SM9 implementations are often CPU-based, with relatively low latency and low throughput. Consequently, a pivotal challenge for SM9 in large-scale applications is how to reduce the latency while maximizing throughput for numerous concurrent inputs. After a systematic analysis of the SM9 algorithms, we apply optimization techniques including precomputation, resource caching and parallelization to reduce the overhead of SM9. In this work, we introduce the first practical implementation of SM9 and its underlying curve on GPU. Our GPU implementation combines multiple algorithms and low-level optimizations tailored for GPU’s single instruction, multiple threads architecture in order to achieve high throughput for SM9. Based on these, we propose , a high-performance Cryptography as a Service (CaaS) for SM9. adopts a heterogeneous computing architecture that flexibly schedules the inputs across two implementation platforms: a CPU for the low-latency processing of sporadic inputs, and a GPU for the high-throughput processing of batch inputs. According to our benchmark, only takes a few milliseconds to process a single SM9 request in idle mode. Moreover, when operating in its batch processing mode, can generate 2,038,071 private keys, 248,239 signatures or 238,001 ciphertexts per second. The results show that scales seamlessly across inputs of different sizes, preliminarily demonstrating the efficacy of our solution.