With the fast development of quantitative portfolio optimization in financial engineering, lots of promising algorithmic trading strategies have shown competitive advantages in recent years. However, the environment from real financial markets is complex and hard to be fully simulated, considering non-stationarity of the stock data, unpredictable hidden causal factors and so on. Fortunately, difference of stock prices is often stationary series, and the internal relationship between difference of stocks can be linked to the decision-making process, then the portfolio should be able to achieve better performance. In this paper, we demonstrate normalizing flows is adopted to simulated high-dimensional joint probability of the complex trading environment, and develop a novel model based reinforcement learning framework to better understand the intrinsic mechanisms of quantitative online trading. Second, we experiment various stocks from three different financial markets (Dow, NASDAQ and S&P 500) and show that among these three financial markets, Dow gets the best performance results on various evaluation metrics under our back-testing system. Especially, our proposed method even resists big drop (less maximum drawdown) during COVID-19 pandemic period when the financial market got unpredictable crisis. All these results are comparatively better than modeling the state transition dynamics with independent Gaussian Processes. Third, we utilize a causal analysis method to study the causal relationship among different stocks of the environment. Further, by visualizing high dimensional state transition data comparisons from real and virtual buffer with t-SNE, we uncover some effective patterns of better portfolio optimization strategies. Our methodology will be beneficial to decision making process in many automatic trading systems.