Emerging 5G technologies can significantly reduce end-to-end service latency for applications requiring strict quality of service (QoS). With network function virtualization (NFV), to complete a client's request from those applications, the client's data can sequentially go through multiple service functions (SFs) for processing/analysis but introduce additional processing delay. To reduce the processing delay from the serially-running SFs, network function parallelism (NFP) that allows multiple SFs to run in parallel is introduced. In this work, we study how to apply NFP into the SF chaining and embedding process such that the latency, including processing and propagation delays, can be jointly minimized. We introduce a novel augmented graph to address the parallel relationship constraint among the required SFs. Considering parallel relationship constraints, we propose a novel problem called parallelism-aware service function chaining and embedding (PSFCE). For this problem, we propose a near-optimal maximum parallel block gain (MPBG) first optimization algorithm when computing resources at each physical node are enough to host the required SFs. When computing resources are limited, we propose a logarithm-approximate algorithm, called parallelism-aware SFs deployment (PSFD), to jointly optimize processing and propagation delays. We conduct extensive simulations on multiple network scenarios to evaluate the performances of our schemes. Accordingly, we find that (i) MPBG is near-optimal, (ii) the optimization of end-to-end service latency largely depends on the processing delay in small networks and is impacted more by the propagation delay in large networks, and (iii) PSFD outperforms the schemes directly extended from existing works regarding end-to-end latency.