This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.
Background on the PileThe Pile is a massive text corpus created by EleutherAI for large-scale language modeling efforts. It is comprised of textual data from 22 sources (see below) and can be downloaded from the official website as well as from a community mirror. Each source dataset is at its core a textual work, and any non-textual data (including metadata) has been removed. While still preserving their internal order, the documents from all the sources have been randomly shuffled. For further information on the Pile, see Gao et al. [2020].This document is not intended to be -and should not be used as -a substitute for a datasheet for the original versions of the component datasets. While it is accurate for the text data that we derived from each component dataset, the original source dataset may have other properties. This document is intended to inform people interested in using the Pile for natural language processing. People interested in using the original datasets should contact the data owners for information about the properties of the original data.It is not always the case that the answer to the questions below are known with certainty. For example, while we have no reason to believe that personal identifying information (PII) is contained in most of the subsets of our dataset, it is always possible that someone wrote down PII in a document and uploaded it to arXiv. Due to the sheer scale of the data, it is impractical to systematically search through every text to validate that it is what it purports to be. We have endeavored to answer the questions below as best we can, and to be open and honest about the limitations of the accuracy of this document. Anyone who engages in research on or with the Pile is welcome to contact us to have their findings added to this document. Similarly, we welcome all comments, suggestions, or corrections.
Datasets contained in the Pile:Pile-CC: The Pile-CC dataset is a sample from the Common Crawl WARCs that has been converted to text using jusText [Endrédy and Novák, 2013].