In modern digital signal processing and graphics applications, the shifter is an important module, consuming a significant amount of delay. This brief presents an architectural optimization approach to synthesize a faster barrel shifter block, which can be useful to reduce the delay of the design without significantly increasing the area. We have divided the problem of generating the shifter into two steps: i) timing-driven selection of multiple stages for merging, and ii) the design of the merged stage. In our proposed method, we define the notion of dual merged stage, where two stages are merged and the triple merged stage, where three stages are merged into a single composite stage. These merged stages are identified by using a timing-driven algorithm and are used in conjunction with some single stages of the traditional barrel shifter. The use of these merged stages helps reduce the depth of the proposed barrel shifter architecture, thereby improving the delay. The timing-driven nature of our algorithm helps produce a faster implementation for the overall shifter block. We have evaluated the performance of our design by using a number of technology libraries, timing constraints and shifter bit-widths. Our experimental data shows that the shifter block generated by our algorithm is significantly faster (10.19% on average) than the shifter block generated by a commercially available datapath synthesis tool. These improvements were verified on placed-and-routed designs as well.