Multiple modality data bring new challenges for sentiment analysis, as combining varieties of information in an effective manner is a rigorous task. Previous works do not effectively utilize the relationship and influence between texts and images. This paper proposes a fusion-extraction network model for multimodal sentiment analysis. First, our model uses an interactive information fusion mechanism to interactively learn the visual-specific textual representations and the textual-specific visual representations. Then, we propose an information extraction mechanism to extract valid information and filter redundant parts for the specific textual and visual representations. The experimental results on two public multimodal sentiment datasets show that our model outperforms existing state-of-the-art methods.