Abstract-The ever-increasing web is an important source for building large-scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web-based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages.Index Terms-Corpus building, boilerplate removal, the web as corpus.
The CoNLL-2000 dataset is the de-facto standard dataset for measuring chunkers on the task of chunking base noun phrases (NP) and arbitrary phrases. The state-of-the-art tagging method is utilising TnT, an HMM-based Part-of-Speech tagger (POS), with simple majority voting on different representations and fine-grained classes created by lexcialising tags. In this paper the state-of-the-art English phrase chunking method was deeply investigated, re-implemented and evaluated with several modifications. We also investigated a less studied side of phrase chunking, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, agglutinative languages. We propose a new, mild level of lexicalisation and a better combination of representations and taggers for English. The final architecture outperformed the state-of-the-art for arbitrary phrase identification and NP chunking achieving the F-score of 95.06% for arbitrary and 96.49% for noun phrase chunking.
Fine tuning features for NP chunking is a difficult task. The effects of a modification are sometimes unpredictable. Feature selection/tuning is usually made in trial-and-error style with long iterating times. Thus, an online toolkit was developed, which addresses three tasks: (1) it can investigate a training corpus made for NP chunking, (2) it makes POS feature suggestions for better NP chunking, and finally (3) the new dataset can be exported. The kit automatically counts an approximated F-score on the fly, as a quick feedback to the linguist. The kit was tested on English and Hungarian corpora. It proved to be able to accelerate preparing datasets for NP chunking effectively, and it gives useful POS feature suggestions from WordNet, resulting in better F-scores. The toolkit needs only a browser (no dependency, nothing to install), and it is easy to use even for non-technical users. The development of features can be controlled in a user friendly way. The tool combines the abstraction ability of a linguist and the power of a statistical engine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.