We present the first comprehensive empirical evaluation of pre-trained language models (PLMs) for legal natural language processing (NLP) in order to examine their effectiveness in this domain.Our study covers eight representative and challenging legal datasets, ranging from 900 to 57K samples, across five NLP tasks: binary classification, multi-label classification, multiple choice question answering, summarization and information retrieval. We first run unsupervised, classical machine learning and/or non-PLM based deep learning methods on these datasets, and show that baseline systems' performance can be 4%∼35% lower than that of PLM-based methods. Next, we compare general-domain PLMs and those specifically pre-trained for the legal domain, and find that domain-specific PLMs demonstrate 1%∼5% higher performance than general-domain models, but only when the datasets are extremely close to the pretraining corpora. Finally, we evaluate six general-domain state-of-the-art systems, and show that they have limited generalizability to legal data, with performance gains from 0.1% to 1.2% over other PLM-based methods. Our experiments suggest that both general-domain and domain-specific PLM-based methods generally achieve better results than simpler methods on most tasks, with the exception of the retrieval task, where the best-performing baseline outperformed all PLM-based methods by at least 5%. Our findings can help legal NLP practitioners choose the appropriate methods for different tasks, and also shed light on potential future directions for legal NLP research.