While data parallelism is well suited from algorithmic, architectural, and linguistic considerations to serve as a basis for portable parallel programming, its characteristic fine-grained parallelism makes the efficient implementation of data-parallel languages on MIMD machmes a challenging task. The design, implementation, and evaluation of an optimizmg compiler are presented for an applicative nested data-parallel language called VCODE targeted at the Encore Multimax, a shared-memory multiprocessor The source language supports nested aggregate data types; aggregate operations including elementwiseformsj scans, reductions, andpermutations; and conditionals and recursion for control flow. A small set of graph-theoretic compile-time optimizations reduce the overheads on MIMD machines in several ways: by increasing the grain size of the output program, by reducing synchronization and storage requirements, and by improving locality of reference. The two key Ideas behind these optimizations are the symbolic analysis of loop structures and hierarchical clustering of the program graph, first by loop structure and then by loop traversal patterns.A benchmark suite demonstrates both the efficiency of the output code and the effectiveness of the optimization.