Despite the significant progress in developing named entity recognition models, scaling to novel-emerging types still remains challenging in real-world scenarios. Continual learning and zero-shot learning approaches have been explored to handle novel-emerging types with less human supervision, but they have not been as successfully adopted as supervised approaches. Meanwhile, humans possess a much larger vocabulary size than these approaches and have the ability to learn the alignment between entities and concepts effortlessly through natural supervision. In this paper, we consider a more realistic and challenging setting called openvocabulary named entity recognition (OVNER) to imitate human-level ability. OVNER aims to recognize entities in novel types by their textual names or descriptions. Specifically, we formulate OVNER as a semantic matching task and propose a novel and scalable two-stage method called Context-Type SemAntiC Alignment and FusiOn (CACAO). In the pre-training stage, we adopt Dual-Encoder for context-type semantic alignment and pre-train Dual-Encoder on 80M context-type pairs which are easily accessible through natural supervision. In the fine-tuning stage, we use Cross-Encoder for context-type semantic fusion and fine-tune Cross-Encoder on base types with human supervision. Experimental results show that our method outperforms the previous state-of-the-art methods on three challenging OVNER benchmarks by 9.7%, 9.5%, and 1.8% F1-score of novel types. Moreover, CACAO also demonstrates its flexible transfer ability in cross-domain NER. 1