Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal similarities, while paying little attention to the representation of each modality. Video is more complicated than the commonly used visual feature, since the audio and caption on the screen also contain rich information. Recently, the aggregations of multiple features in videos boost the benchmark of the video-text retrieval system. However, they usually handle each feature independently, which ignores the interchange of high-level semantic relations among these multiple features. Moreover, despite the inter-modal ranking constraint where semantically-similar texts and videos should stay closer, the modality-specific requirement, i.e. two similar videos/texts should have similar representations, is also significant. In this paper, we propose a novel Multi-Feature Graph ATtention Network (MFGATN) for cross-modal video-text retrieval. Specifically, we introduce a multi-feature graph attention module, which enriches the representation of each feature in videos with the interchange of highlevel semantic information among them. Moreover, we elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra-modal structure constraint to preserve both the crossmodal semantic similarity and the modality-specific consistency * Corresponding author.