With the number of cores increasing rapidly but the performance per core increasing slowly at best, software must be parallelized in order to improve performance. Manual parallelization is often prohibitively time-consuming and error-prone (especially due to data races and memory-consistency complexities), and some portions of code may simply be too difficult to understand or refactor for parallelization. Most existing automatic parallelization techniques are performed statically at compile time and require source code to be analyzed, leaving a large fraction of software behind.In many cases, some or all of the source code and development tool chain is lost or, in the case of third-party software, was never available. Furthermore, modern applications are assembled and defined at run time, making use of shared libraries, virtual functions, plugins, dynamically-generated code, and other dynamic mechanisms, as well as multiple languages. All these aspects of separate compilation prevent the compiler from obtaining a holistic view of the program, leading to the risk of incompatible parallelization techniques, subtle data races, and resource over-subscription. All the above considerations motivate dynamic binary parallelization (DBP).This dissertation explores the novel idea of trace-based DBP, which provides a large instruction window without introducing spurious dependencies. We hypothesize that traces provide a generally good trade-off between code visibility and analysis accuracy for a wide variety of applications so as to achieve better parallel performance. Compared to the raw dynamic instruction stream (DIS), traces expose more distant parallelism opportunities because their average length is typically much larger than the size of the hardware instruction window. Compared to the complete control flow graph (CFG), traces only contain control and data dependencies on the execution path which is actually taken. More importantly, while DIS-based DBP typically only exploits fine-grained parallelism and CFG-based DBP typically only exploits coarse-grained parallelism, traces can be used as a unified representation of program execution to seamlessly incorporate the exploitation of both coarse-and fine-grained parallelism.We develop Tracy, an innovative DBP framework which monitors a program at run time and i Abstract ii dynamically identifies hot traces, parallelizes them, and caches them for later use so that the program can run in parallel every time a hot trace repeats. Our experimental results have demonstrated that for floating point benchmarks, Tracy can achieve an average speedup of 2.16x, 1.51x better than the speedup achieved by Core Fusion, one representative of DIS-based DBP techniques. Although the average speedup achieved by Tracy is only 1.04x better than the speedup achieved by CFG-based DBP, Tracy can speed up all floating point benchmarks while CFG-based DBP fails to parallelize three out of eight applications at all. The performance of Tracy is not always better than the performance of exist...