In this paper, the impact of direct liquid cooling (DLC) system failure on the information technology (IT) equipment is studied experimentally. The main factors that are anticipated to affect the IT equipment response during failure are the central processing unit (CPU) utilization, coolant set point temperature (SPT), and the server type. These factors are varied experimentally and the IT equipment response is studied in terms of chip temperature and power, CPU utilization, and total server power. It was found that failure of this cooling system is hazardous and can lead to data center shutdown in less than a minute. Additionally, the CPU frequency throttling mechanism was found to be vital to understand the change in chip temperature, power, and utilization. Other mechanisms associated with high temperatures were also observed such as the leakage power and the fans' speed change. Finally, possible remedies are proposed to reduce the probability and the consequences of the cooling system failure.
This article reports experimental and numerical testing performed to characterize the operation and reliability of the open compute (OC) storage system in contained environment from server to aisle levels. The study is comprised of three parts. The first part is an experimental analysis of the high density (HD) 3D array storage unit thermal and utilization responses during airflow imbalances. This is done with the stress test proposed for IT in containment to mimic possible mismatch and cascade failure scenarios. It is found that downstream HDDs are most prone to overheating and loss in utilization during an airflow imbalance. This was proven to undermine the storage capacity of the hard disk drives. An IT level airflow prediction model is discussed for the storage unit and validated for different fan speeds. In the second part, a computational fluid dynamics model is created for a high density open rack based on the active flow curve method. Here, the measured airflow response curves for the open compute IT (storage and compute servers) are used to build compact models and run rack level testing for IT air systems sensitivity and create a rack level AFC (active flow curve) airflow demand prediction model. Finally, the experimental characterization data is used to build an aisle level model (POD) that incorporates IT fan control systems (FCS). This modeling approach yields shorter uptime during chiller failure due to increased recirculation induced by increased IT airflow demand during cases such as chiller failure or high economizer temperatures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.