The topology of endoplasmic reticulum (ER) network is highly regulated by various cellular and environmental stimuli and affects major functions such as protein quality control and the cell's response to metabolic changes. The ability to quantify the dynamical changes of the ER structures in response to cellular perturbations is crucial for the development of novel therapeutic approaches against ER associated diseases, such as hereditary spastic paraplegias and Niemann Pick Disease type C. However, the rapid movement and small spatial dimension of ER networks make this task challenging. Here, we combine video-rate super-resolution imaging with a state-of-the-art semantic segmentation method capable of automatically classifying sheet and tubular ER domains inside individual cells. Data are skeletonised and represented by connectivity graphs to enable the precise and efficient quantification and comparison of the network connectivity from different complex ER phenotypes. The method, called ERnet, is powered by a Vision Transformer architecture, and integrates multi-head self-attention and channel attention into the model for adaptive weighting of frames in the time domain. We validated the performance of ERnet by measuring different ER morphology changes in response to genetic or metabolic manipulations. Finally, as a means to test the applicability and versatility of ERnet, we showed that ERnet can be applied to images from different cell types and also taken from different imaging setups. Our method can be deployed in an automatic, high-throughput, and unbiased fashion to identify subtle changes in cellular phenotypes that can be used as potential diagnostics for propensity to ER mediated disease, for disease progression, and for response to therapy.