BackgroundReal‐time tumor tracking is one motion management method to address motion‐induced uncertainty. To date, fiducial markers are often required to reliably track lung tumors with X‐ray imaging, which carries risks of complications and leads to prolonged treatment time. A markerless tracking approach is thus desirable. Deep learning‐based approaches have shown promise for markerless tracking, but systematic evaluation and procedures to investigate applicability in individual cases are missing. Moreover, few efforts have been made to provide bounding box prediction and mask segmentation simultaneously, which could allow either rigid or deformable multi‐leaf collimator tracking.PurposeThe purpose of this study was to implement a deep learning‐based markerless lung tumor tracking model exploiting patient‐specific training which outputs both a bounding box and a mask segmentation simultaneously. We also aimed to compare the two kinds of predictions and to implement a specific procedure to understand the feasibility of markerless tracking on individual cases.MethodsWe first trained a Retina U‐Net baseline model on digitally reconstructed radiographs (DRRs) generated from a public dataset containing 875 CT scans and corresponding lung nodule annotations. Afterwards, we used an independent cohort of 97 lung patients to develop a patient‐specific refinement procedure. In order to determine the optimal hyperparameters for automatic patient‐specific training, we selected 13 patients for validation where the baseline model predicted a bounding box on planning CT (PCT)‐DRR with intersection over union (IoU) with the ground‐truth higher than 0.7. The final test set contained the remaining 84 patients with varying PCT‐DRR IoU. For each testing patient, the baseline model was refined on the PCT‐DRR to generate a patient‐specific model, which was then tested on a separate 10‐phase 4DCT‐DRR to mimic the intrafraction motion during treatment. A template matching algorithm served as benchmark model. The testing results were evaluated by four metrics: the center of mass (COM) error and the Dice similarity coefficient (DSC) for segmentation masks, and the center of box (COB) error and the DSC for bounding box detections. Performance was compared to the benchmark model including statistical testing for significance.ResultsA PCT‐DRR IoU value of 0.2 was shown to be the threshold dividing inconsistent (68%) and consistent (100%) success (defined as mean bounding box DSC > 0.6) of PS models on 4DCT‐DRRs. Thirty‐seven out of the eighty‐four testing cases had a PCT‐DRR IoU above 0.2. For these 37 cases, the mean COM error was 2.6 mm, the mean segmentation DSC was 0.78, the mean COB error was 2.7 mm, and the mean box DSC was 0.83. Including the validation cases, the model was applicable to 50 out of 97 patients when using the PCT‐DRR IoU threshold of 0.2. The inference time per frame was 170 ms. The model outperformed the benchmark model on all metrics, and the comparison was significant (p < 0.001) over the 37 PCT‐DRR IoU > 0.2 cases, but not over the undifferentiated 84 testing cases.ConclusionsThe implemented patient‐specific refinement approach based on a pre‐trained baseline model was shown to be applicable to markerless tumor tracking in simulated radiographs for lung cases.