Today there is an urgent need for algorithms, programming language systems and tools, and hardware that deliver on the potential of parallelism due to the end of Dennard scaling. This work (from my PhD dissertation, supervised by Ken Kennedy) was one of the early papers to optimize for and experimentally explore the tension between data locality and parallelism on shared memory machines. A key result was that false sharing of cache lines between processors with local caches on separate chips was disastrous to the performance and scaling of applications. This retrospective includes a short personal tour through the history of parallel computing, a discussion of locality and parallelism modeling versus a polyhedral formulation of optimizing dense matrix codes, and how this problem is still relevant to compilers today. I end with a short memorial to my deceased co-author and advisor Ken Kennedy. Parallel computing seemed to be entering its heyday in the late 1980s and early 1990s. At Rice in 1989, Ken Kennedy was awarded an NSF Science and Technology Center for the Center for Research on Parallel Computing (CRPC) as the Principal Investigator. The CRPC started with seven sites and eventually included 400 researchers, staff, and graduate students. Their technical expertise spanned parallel algorithms, compilers, runtimes, and hardware. The CRPC vision that Ken, his collaborators, and students shared was to invent parallel algorithms for critical problems in science, coupled with programming language tools, such as compilers, runtime systems, and programming environments, that made Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). them run fast. We were not trying to solve the dusty-deck problem of automatically converting sequential algorithms to parallel ones. We understood that parallel and sequential algorithms for the same problem require different solutions. However, tools would do heavy lifting to map application parallelism to hardware parallelism, such that the programmers would not have to reimplement their algorithms for each new parallel architecture. A key aspect of this problem is balancing parallelism, sharing between tasks, and memory usage, which was the topic our paper addressed.In this same period, a number of established companies and startups, such as Sequent, had introduced parallel machines. The Sequent Symmetry was the machine on which we reported our results. It was not yet clear that the research and development challenges of parallel computing would make it too costly to win in the market place in the short term. By the mid 1990s, this generation of parallel computers together with some of the compa...