Spark NLP: Natural Language Understanding at Scale

Kocaman, Veysel; Talby, David

doi:10.1016/j.simpa.2021.100058

Cited by 50 publications

(28 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we formatted the training and testing data to conform to the conference on natural language learning (CoNLL) format. We then used a pre-trained deep learning model, provided by the Python "sparknlp" package [41], to produce ELMo embeddings for each sentence's tokens. These word embeddings were used as features in the deep learning NER model that was generated using the "sparknlp" package.…”

Section: Methodsmentioning

confidence: 99%

MSCAT: A Machine Learning Assisted Catalog of Metabolomics Software Tools

et al. 2021

View full text Add to dashboard Cite

The bottleneck for taking full advantage of metabolomics data is often the availability, awareness, and usability of analysis tools. Software tools specifically designed for metabolomics data are being developed at an increasing rate, with hundreds of available tools already in the literature. Many of these tools are open-source and freely available but are very diverse with respect to language, data formats, and stages in the metabolomics pipeline. To help mitigate the challenges of meeting the increasing demand for guidance in choosing analytical tools and coordinating the adoption of best practices for reproducibility, we have designed and built the MSCAT (Metabolomics Software CATalog) database of metabolomics software tools that can be sustainably and continuously updated. This database provides a survey of the landscape of available tools and can assist researchers in their selection of data analysis workflows for metabolomics studies according to their specific needs. We used machine learning (ML) methodology for the purpose of semi-automating the identification of metabolomics software tool names within abstracts. MSCAT searches the literature to find new software tools by implementing a Named Entity Recognition (NER) model based on a neural network model at the sentence level composed of a character-level convolutional neural network (CNN) combined with a bidirectional long-short-term memory (LSTM) layer and a conditional random fields (CRF) layer. The list of potential new tools (and their associated publication) is then forwarded to the database maintainer for the curation of the database entry corresponding to the tool. The end-user interface allows for filtering of tools by multiple characteristics as well as plotting of the aggregate tool data to monitor the metabolomics software landscape.

show abstract

Section: Methodsmentioning

confidence: 99%

MSCAT: A Machine Learning Assisted Catalog of Metabolomics Software Tools

et al. 2021

View full text Add to dashboard Cite

show abstract

“…As deep learning models have successfully in NLP, there is a need to implement pre-trained models and scale large data with distributed use cases. John Snow Labs 2 developed Spark NLP as a library built on top of Apache Spark and Apache MLib that provides an NLP pipeline and pre-trained models [17]. The library offers the ability to train, customize and save models so they can be run on clusters, other machines, or stored.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, we use the existing pre-trained models [8], [16].To demonstrate the efficiency of this method, we conducted extensive experiments to study our proposed approach. We use Spark NLP built on top of Apache Spark as a library that can scale the entire classification process in a distributed environment [17]. We compared the performance of the base method model with the classifier pipelines from Spark NLP.…”

Section: Introductionmentioning

confidence: 99%

Large-Scale News Classification using BERT Language Model: Spark NLP Approach

Nugroho

Sukmadewa

Yudistira

2021

6th International Conference on Sustainable Information Engineering and Technology 2021

View full text Add to dashboard Cite

The rise of big data analytics on top of NLP increasing the computational burden for text processing at scale. The problems faced in NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all models using BERT is 0.9187 and 35 minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%, while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.CCS CONCEPTS • Computing methodologiesàArtificial intelligenceàNatural language processing • Computing methodologiesàParallel computing methodologiesàParallel algorithms.

show abstract

“…of the ordered laboratory tests, and 4) patient demographics (age/race/sex/ethnicity). We parsed each note into sections and used the SparkNLP library 35 named entity recognizer (NER) for extracting medical conditions from the clinical notes (see Supplementary section on "Data Sources" for implementation details). The extractions were used to determine the presence or absence of baseline risk factors for each patient at the time of admission, including: Coronary Artery Disease (CAD), diabetes, family history, hyperlipidemia, hypertension, existing medication, obesity, and smoking.…”

Section: Data Sourcesmentioning

confidence: 99%

Preparing For The Next Pandemic: Transfer Learning From Existing Diseases Via Hierarchical Multi-Modal BERT Models to Predict COVID-19 Outcomes

Agarwal

Choudhury

Tipirneni

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Developing prediction models for emerging infectious diseases from relatively small numbers of cases is a critical need for improving pandemic preparedness. Using COVID-19 as an exemplar, we propose a transfer learning methodology for developing predictive models from multi-modal electronic healthcare records by leveraging information from more prevalent diseases with shared clinical characteristics. Our novel hierarchical, multi-modal model (TransMED) integrates baseline risk factors from the natural language processing of clinical notes at admission, time-series measurements of biomarkers obtained from laboratory tests, and discrete diagnostic, procedure and drug codes. We demonstrate the alignment of TransMED's predictions with well-established clinical knowledge about COVID-19 through univariate and multivariate risk factor driven sub-cohort analysis. TransMED's superior performance over state-of-the-art methods shows that leveraging patient data across modalities and transferring prior knowledge from similar disorders is critical for accurate prediction of patient outcomes, and this approach may serve as an important tool in the early response to future pandemics.

show abstract

Spark NLP: Natural Language Understanding at Scale

Cited by 50 publications

References 14 publications

MSCAT: A Machine Learning Assisted Catalog of Metabolomics Software Tools

MSCAT: A Machine Learning Assisted Catalog of Metabolomics Software Tools

Large-Scale News Classification using BERT Language Model: Spark NLP Approach

Preparing For The Next Pandemic: Transfer Learning From Existing Diseases Via Hierarchical Multi-Modal BERT Models to Predict COVID-19 Outcomes

Contact Info

Product

Resources

About