Person-Action instance search (P-A INS) aims to retrieve the instances of a specific person doing a specific action, which appears in the 2019–2021 INS tasks of the world-famous TREC Video Retrieval Evaluation (TRECVID). Most of the top-ranking solutions can be summarized with a Division-Fusion-Optimization (DFO) framework, in which person and action recognition scores are obtained separately, then fused, and, optionally, further optimized to generate the final ranking. However, TRECVID only evaluates the final ranking results, ignoring the effects of intermediate steps and their implementation methods. We argue that conducting the fine-grained evaluations of intermediate steps of DFO framework will (1) provide a quantitative analysis of the different methods’ performance in intermediate steps; (2) find out better design choices that contribute to improving retrieval performance; and (3) inspire new ideas for future research from the limitation analysis of current techniques. Particularly, we propose an indirect evaluation method motivated by the leave-one-out strategy, which finds an optimal solution surpassing the champion teams in 2020–2021 INS tasks. Moreover, to validate the generalizability and robustness of the proposed solution under various scenarios, we specifically construct a new large-scale P-A INS dataset and conduct comparative experiments with both the leading NIST TRECVID INS solution and the state-of-the-art P-A INS method. Finally, we discuss the limitations of our evaluation work and suggest future research directions.