We prove that, at least for the binary erasure channel, the polar-coding paradigm gives rise to codes that not only approach the Shannon limit but, in fact, do so under the best possible scaling of their block length as a function of the gap to capacity. This result exhibits the first known family of binary codes that attain both optimal scaling and quasi-linear complexity of encoding and decoding. Specifically, for any fixed δ > 0, we exhibit binary linear codes that ensure reliable communication at rates within ε > 0 of capacity with block length n = O(1/ε 2+δ ), construction complexity Θ(n), and encoding/decoding complexity Θ(n log n).Our proof is based on the construction and analysis of binary polar codes with large kernels. It was recently shown that, for all binary-input symmetric memoryless channels, conventional polar codes (based on a 2 × 2 kernel) allow reliable communication at rates within ε > 0 of capacity with block length, construction, encoding and decoding complexity all bounded by a polynomial in 1/ε. In particular, this means that the block length n scales as O(1/ε µ ), where the constant µ is called the scaling exponent. It is furthermore known that the optimal scaling exponent is µ = 2, and it is achieved by random linear codes. However, for general channels, the decoding complexity of random linear codes is exponential in the block length. As far as conventional polar codes, their scaling exponent depends on the channel, and for the binary erasure channel it is given by µ = 3.63. This falls far short of the optimal scaling guaranteed by random codes.Our main contribution is a rigorous proof of the following result: there exist ℓ × ℓ binary kernels, such that polar codes constructed from these kernels achieve scaling exponent µ(ℓ) that tends to the optimal value of 2 as ℓ grows. We furthermore characterize precisely how large ℓ needs to be as a function of the gap between µ(ℓ) and 2. The resulting binary codes maintain the beautiful recursive structure of conventional polar codes, and thereby achieve construction complexity Θ(n) and encoding/decoding complexity Θ(n log n). This implies that block length, construction, encoding, and decoding complexity are all linear or quasi-linear in 1/ε 2 , which meets the information-theoretic lower bound.