Tracking with Natural-Language Specification (TNL) is a joint topic of understanding the vision and natural language with a wide range of applications. In previous works, the communication between two heterogeneous features of vision and language is mainly through a simple dynamic convolution. However, the performance of prior works is capped by the difficulty of linguistic variation of natural language in modeling the dynamically changing target and its surroundings. In the meanwhile, natural language and vision are firstly fused and then utilized for tracking, which is hard to model the query-focused context. Query-focused should pay more attention to context modeling to promote the correlation between these two features. To address these issues, we propose a capsule-based network, referred to as CapsuleTNL, which performs regression tracking with natural language query. In the beginning, the visual and textual input is encoded with capsules, which can not only establish the relationship between entities but also the relationship between the parts of the entity itself. Then, we devise two interaction routing modules, which consist of visual-textual routing module to reduce the linguistic variation of input query and textual-visual routing module to precisely incorporate query-based visual cues simultaneously. To validate the potential of the proposed network for visual object tracking, we evaluate our method on two large tracking benchmarks. The experimental evaluation demonstrates the effectiveness of our capsule-based network.