The capacity of four eddy-resolving ocean circulation models—HYCOM (HYbrid Coordinate Ocean Model), MRI.COM (Meteorological Research Institute Community Ocean Model), OFES (Ocean General Circulation Model for the Earth Simulator), and NEMO (Nucleus for European Modeling of the Ocean)—to simulate the long-term mean hydrographic conditions and circulation patterns in the Japan Sea is investigated in this study. The assessment of this study includes the evaluation of mean vertical profiles and time series of temperature and salinity at the representative monitoring stations. Different model products from 1993 to 2015 are compared with in situ measurements provided by historical cruises and monitoring stations. After that, we compared the observed and simulated surface current velocities over the basin and volume transports through the key straits in the Japan Sea. Simulated current velocities are validated against 15 years of Acoustic Doppler Current Profiler (ADCP) measurements near the longshore and offshore branches of the East Korea Warm Current (EKWC). Furthermore, the atmospheric forcing data of the four ocean circulation models are validated against the satellite wind product. We found that the vertical profiles and long-term variations of temperature and salinity reproduced by MRI.COM and HYCOM are closer to in situ measurements. All models simulate temperature well in upper ocean, but salinity simulations are of lower quality from OFES and NEMO at several stations. Simulated current velocities predominantly lie within the standard deviation of ADCP measurements at two locations. However, the sea surface currents are underestimated by four models compared with Drifter data. Although simulated hydrographic profiles agree well with in situ observations, the mean circulation patterns greatly differ between the models, which highlight the need for additional evaluation and corrections based on the long-term current measurements. Because of the lack of ocean current measurements, only the baroclinic velocities simulated by each model are reliable. The substantial part of the differences in barotropic velocities among the simulate result of four models is explained by the differing wind velocities from the corresponding atmospheric forcing datasets.