learning, reinforcement learning (RL) is one of machine learning paradigms, which allow the machine to interact with the environment and update the policy according to the negative or positive reward signals. [10,11] The algorithms of RL [10] mainly include artificial neural network (ANN)-based deep Q-learning (DQN) and spiking neural network (SNN)based reward-modulated spike-timingdependent plasticity (R-STDP). Compared with ANN-based DQN that adopts error backpropagation and gradient descent to update the weight, R-STDP updates the weight by incorporating the brain-inspired STDP with reward signals being biologically more plausible. [12][13][14][15][16] The hardware implementation of RL paradigm with DQN based on emerging nonvolatile memory technologies (NVMs) has been reported recently. [17][18][19] However, hardware realization of R-STDP based on NVMs is far more less explored. The R-STDP learning rule mainly relies on the modulation of the "standard" STDP by a reward term (positive or negative reward), which is from the external environment, thus realizing SNN-based reinforcement learning. R-STDP stores the traces of synapses that are eligible for STDP (eligibility traces), and applies the modulated weight changes at the time of receiving a positive/negative reward signal. Eligibility traces are a critical ingredient of R-STDP learning rules. [20] There have been some works realizing long-lasting eligibility traces with complementary metal-oxide-semiconductor (CMOS) technology combining large capacitors. [21][22][23][24][25] Recently, Demirağ et al. exploit the drift behavior of phase change memory devices to intrinsically perform eligibility traces with long timescales realizing higher area efficiency than the CMOS ones. [26] In addition to realize eligibility traces constructing STDP, the polarity of STDP will be further changed according to the sign of external reward signal. Thus, it requires emerging nonvolatile devices with reconfigurable characteristics for both STDP and anti-STDP.2D semiconductor field-effect (FE) transistors (FETs) with atomic-level thickness show excellent electrostatic tunability, [27][28][29][30][31][32][33] which provides the possibility of modulating the channel to be reconfigurable p-type or n-type for both STDP and anti-STDP learning rules. On the other hand, to realize nonvolatile memory characteristics, we adopt the ferroelectric poly(vinylidene fluoride-trifluoroethylene) (P(VDF-TrFE)) as gate dielectric for dynamically tunable memory Reward-modulated spike-timing-dependent plasticity (R-STDP) is a braininspired reinforcement learning (RL) rule, exhibiting potential for decisionmaking tasks and artificial general intelligence. However, the hardware implementation of the reward-modulation process in R-STDP usually requires complicated Si complementary metal-oxide-semiconductor (CMOS) circuit design that causes high power consumption and large footprint. Here, a design with two synaptic transistors (2T) connected in a parallel structure is experimentally demonstrated. The 2T...