“…Visual Appearance Features are extracted from the predicate box, i.e. the minimum rectangle that encompasses the subject box and the object box [1,12,2,13,14,3], the separate subject-object boxes [5,15,16,17], or both [18,19,4,20,21]. All the above train a single branch with visual features, while we jointly train two separate branches with different features, a predicate feature branch (P-branch) and an object-subject branch (OS-branch), and employ Deep Supervision to align their scores into a common space.…”