Non-volatile memory (NVM) is a promising technology for low-energy and high-capacity main memory of computers. The characteristics of NVM devices, however, tend to be fundamentally different from those of DRAM (i.e., the memory device currently used for main memory), because of differences in principles of memory cells. Typically, the write latency of an NVM device such as PCM and ReRAM is much higher than its read latency. The asymmetry in read/write latencies likely affects the performance of applications significantly. For analyzing behavior of applications running on NVM-based main memory, most researchers use software-based emulation tools due to the limited number of commercial NVM products. However, these existing emulation tools are too slow to emulate a large-scale, realistic workload or too simplistic to investigate the details of application behavior on NVM with asymmetric read/write latencies. This paper therefore proposes a new NVM emulation mechanism that is not only light-weight but also aware of a read/write latency gap in NVM-based main memory. We implemented the prototype of the proposed mechanism for the Intel CPU processors of the Haswell architecture. We also evaluated its accuracy and performed case studies for practical benchmarks. The results showed that our prototype accurately emulated write-latencies of NVM-based main memory: it emulated the NVM write latencies in a range from 200 ns to 1000 ns with negligible errors from 0.2% to 1.1%. We confirmed that the use of our emulator enabled us to successfully estimate performance of practical workloads for NVM-based main memory, while an existing light-weight emulation model misestimated. key words: middleware, non-volatile memory, performance emulation, asymmetric read/write latencies, write-back awareness † The author is with Tokyo University of Agriculture and Technology, Tokyo, 184-8588 Japan.Note: this paper extends our preliminary work published at NVMSA 2017 [5]. Specifically, we reimplemented a prototype of our emulator for the Intel Haswell processors, which previously targeted for an old processor architecture (i.e., Sandy Bridge) to verify the portability of our emulator for newer processor families. Along with the reimplementation, we drastically improved the accuracy of the emulator by fixing bugs of cache miss measurement; the worst emulation error of the NVM write latency was mitigated from 28.6% to 1.1%. Moreover, we conducted thorough experiments using various workloads. All the parts of the paper are also thoroughly updated to improve the quality of the paper.