The adoption of register-transfer level (RTL) sign-off in ASIC design methodologies, and the increasing scale of system-on-chip integration, are leading to unprecedented accuracy and efficiency demands on RT-level estimation tools. In this work, we focus on the deployment of a simulation-based RTL power estimation tool in a commercial design flow, and describe several enhancements that improve its efficiency and scalability for large, industrial designs. We profile the computational effort involved in RTL power estimation, and propose a suite of acceleration techniques, including (i) transformation of the enhanced RTL description (functional model with annotations for power estimation) to be more simulatorfriendly, (ii) computation vs. storage tradeoffs, and (iii) a novel variation of statistical sampling, called partitioned sampling. Our techniques result in an optimized allocation of the overall computational effort for power estimation and minimize the computational effort involved in the evaluation of power models.Extensive experimental results in the context of a commercial design flow have yielded promising results (e.g., upto 31X reduction in power estimation time with negligible loss of accuracy) on industrial designs of upto 1.25 million transistors. In addition to accurate power estimation for an entire circuit, these acceleration techniques result in superior accuracy of local power estimates for individual components or small sub-circuits, compared to conventional sampling or test-bench compaction techniques.