Parallel turbo decoding is becoming mandatory in order to achieve high throughput and to reduce latency, both crucial in emerging digital communication applications. This paper explores and analyzes parallelism techniques in convolutional turbo decoding with the BCJR algorithm. A three-level structured classification of parallelism techniques is proposed and discussed: BCJR metric level parallelism, BCJR-SISO decoder level parallelism, and Turbo-decoder level parallelism. The second level of this classification is thoroughly analyzed on the basis of parallelism efficiency criteria, since it offers the best tradeoff between achievable parallelism degree and area overhead. At this level, and for subblock parallelism, we illustrate how subblock initializations are more efficient with the message passing technique than with the acquisition approach. Besides, subblock parallelism becomes quite inefficient for high subblock parallelism degree. Conversely, component-decoder parallelism efficiency increases with subblock parallelism degree. This efficiency, moreover, depends on BCJR computation schemes and on propagation time. We show that componentdecoder parallelism using shuffled decoding enables to maximize architecture efficiency and, hence, is well suited for hardware implementation of high throughput turbo decoder.