Recognizing facial expressions and estimating their corresponding action units' intensities have achieved many milestones. However, such estimating is still challenging due to subtle action units' variations during emotional arousal. The latest approaches are confined to the probabilistic models' characteristics that model action units' relationships. Considering ordinal relationships across an emotional transition sequence, we propose two metric learning approaches with self-attention-based triplet and Siamese networks to estimate emotional intensities. Our emotion expert branches use shifted-window SWIN-transformer which restricts self-attention computation to adjacent windows while also allowing for cross-window connection. This offers flexible modeling at various scales of action units with high performance. We evaluated our network's spatial and time-based feature localization on CK+, KDEF-dyn, AFEW, SAMM, and CASME-II datasets. They outperform deep learning state-of-the-art methods in micro-expression detection on the latter two datasets with 2.4% and 2.6% UAR respectively. Ablation studies highlight the strength of our design with a thorough analysis.