Hyperparameter optimization (HO) is an important problem in machine learning which is normally formulated as a bilevel optimization problem. Gradient-based methods are dominant in bilevel optimization due to their high scalability to the number of hyperparameters, especially in a deep learning problem. However, traditional gradient-based bilevel optimization methods need intermediate steps to obtain the exact or approximate gradient of hyperparameters, namely hypergradient, for the upper-level objective, whose complexity is high especially for high dimensional datasets. Recently, a penalty method has been proposed to avoid the computation of the hypergradient, which speeds up the gradient-based BHO methods. However, the penalty method may result in a very large number of constraints, which greatly limits the efficiency of this method, especially for high dimensional data problems. To address this limitation, in this paper, we propose a doubly stochastic gradient descent algorithm (DSGPHO) to improve the efficiency of the penalty method. Importantly, we not only prove the proposed method can converge to the KKT condition of the original problem in a convex setting, but also provide the convergence rate of DSGPHO which is the first result in the references of gradient-based bilevel optimization as far as we know. We compare our method with three state-of-the-art gradient-based methods in three tasks, i.e., data denoising, few-shot learning, and training data poisoning, using several large-scale benchmark datasets. All the results demonstrate that our method outperforms or is comparable to the existing methods in terms of accuracy and efficiency.