Tokenization in a text document is regarded as a primary natural language processing task for feature generation, and it plays a vital role in sentiment analysis, information retrieval, part of speech tagging, and named entity recognition. Urdu is spoken by around 170.2 million people worldwide as their first or second language. It is a morphologically and orthographically rich language. Word tokenization in Urdu text documents is very challenging because word boundaries are not specified by only space, as in other languages. A compound, a multi-word expression, is a more complex word consisting of multiple strings or independent base words. Tokens are the minimal unit of any language with a suitable semantic structure. Traditionally, bigram or trigram approaches represent compound words in the tokenization process. This research proposes a morphological rules-based approach to identify compound words in Urdu text for tokenization. A thorough evaluation is performed on a dataset of reasonable size to compare the performance of the proposed technique with traditional approaches. Results show that the proposed method can accurately identify the compound words for the tokenization of Urdu text documents. Notably, using morphological rule-based techniques for compound words reduces the number of extracted features.