In this paper, a novel learning method called postboosting using extended G-mean (PBG) is proposed for online sequential multiclass imbalance learning (OS-MIL) in neural networks. PBG is effective due to three reasons. 1) Through postadjusting a classification boundary under extended G-mean, the challenging issue of imbalanced class distribution for sequentially arriving multiclass data can be effectively resolved. 2) A newly derived update rule for online sequential learning is proposed, which produces a high G-mean for current model and simultaneously possesses almost the same information of its previous models. 3) A dynamic adjustment mechanism provided by extended G-mean is valid to deal with the unresolved challenging dense-majority problem and two dynamic changing issues, namely, dynamic changing data scarcity (DCDS) and dynamic changing data diversity (DCDD). Compared to other OS-MIL methods, PBG is highly effective on resolving DCDS, while PBG is the only method to resolve dense-majority and DCDD. Furthermore, PBG can directly and effectively handle unscaled data stream. Experiments have been conducted for PBG and two popular OS-MIL methods for neural networks under massive binary and multiclass data sets. Through the analyses of experimental results, PBG is shown to outperform the other compared methods on all data sets in various aspects including the issues of data scarcity, dense-majority, DCDS, DCDD, and unscaled data.
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MATHPILE, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pretraining phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MATH-PILEcan help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of MATH-PILEwith the scripts used for processing, to facilitate future developments in this field.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.