The fast and wide-ranging pervasion of data and information over the web possess a high dispersion of an enormous capacity of normal language textual possessions. Excessive attention has been evolved in the existing scenario for determining, distribution and retrieving of an enormous source of knowledge. For this purpose, processing enormous data capacities in a sensible time frame is an important challenge and a vital necessity in numerous commercial and exploration fields. Computer clusters, distributed systems and parallel computing paradigms are being progressively applied in the current years; subsequently they presented important developments for computing presentation in data-intensive contexts, like Big Data mining and analysis. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. This study shows a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. The system was found on Apache Hadoop environment, and on its equivalent programming paradigm, called MapReduce. Authentication is done using the explanation for extracting keywords and critical phrase from the web documents in a multinode Hadoop cluster. The results of the proposed work shows increased storage capacity, increased speed in data processing, reduced user searching time and receives the accurate content from the large dataset stored in HBase.