Non-volatile memories are gaining significant attention for embedded cache application due to their low standby power and excellent retention. Domain wall memory (DWM) is one possible candidate due to its ability to store multiple bits per cell in order to break the density barrier. Additionally, it provides low standby power, fast access time, good endurance and retention. However, it suffers from poor write latency, shift latency, shift power and write power. DWM is sequential in nature and latency of read/write operations depends on the offset of the bit from the read/write head. This paper investigates the circuit design challenges such as bitcell layout, head positioning, utilization factor of the nanowire, shift power, shift latency and provides solutions to deal with these issues. A synergistic system is proposed by combining circuit techniques such as merged read/write heads (for compact layout), flipped-bitcell and shift gating (for shift power optimization), wordline (WL) strapping (for access latency), shift circuit design with micro-architectural techniques such as segmented cache to realize energy-efficient and robust DWM cache. Simulations show 3-33% better performance and 1.25X-14.4X better power over a wide range of PARSEC benchmarks.