Numerical Optimizations for Weighted Low-rank Estimation on Language Models

Ting, Hua; Hsu, Yen-Chang; Felicity, Wang,; Lou, Qian; Shen, Yilin; Jin, Hongxia

doi:10.18653/v1/2022.emnlp-main.91

Cited by 24 publications

(55 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7. Training by reward network: The successful application of Human-feedback Reinforcement Learning (HFRL) to the training process for LLMs has been established [93], and we borrow ideas from [94,95]. We consider cell types as the human label for different cells, akin to the labels of sentences in the NLP area.…”

Section: B Initial Settingsmentioning

confidence: 99%

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Liu,

Li,

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

Large Language Models (LLMs) have made significant strides in both industrial and scientific domains. In this paper, we evaluate the performance of LLMs in single-cell sequencing data analysis through comprehensive experiments across eight downstream tasks pertinent to single-cell data. By comparing seven different single-cell LLMs with task-specific methods, we found that single-cell LLMs may not consistently excel in all tasks than task-specific methods. However, the emergent abilities and the successful applications of cross-species/cross-modality transfer learning of LLMs are promising. In addition, we present a systematic evaluation of the effects of hyper-parameters, initial settings, and stability for training single-cell LLMs based on a proposedscEvalframework, and provide guidelines for pre-training and fine-tuning. Our work summarizes the current state of single-cell LLMs, and points to their constraints and avenues for future developments.

show abstract

Section: B Initial Settingsmentioning

confidence: 99%

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Liu,

Li,

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…By making adjustments to a limited set of parameters, these techniques avoid (potentially costly) modifications to the much larger backend architecture. There are two primary methods for PE tuning: (i) Training a subset of model parameters, usually done by placing a linear probe on top of pretrained features [37], and (ii) integrating small modules within the network [28,39,15,21,17].…”

Section: Related Workmentioning

confidence: 99%

“…We evaluated several modules for the modular-update method, including Adapter [16], LoRA [17], and VPT [21]. Due to space limitations, we only include the results of the Adapter method in the main paper, while the results of the LoRA and VPT methods are similar and relegated to the supplementary material B.3.…”

Section: Modulesmentioning

confidence: 99%

Data Reconstruction from Gradient Updates in Federated Learning

Zhang

et al. 2023

Machine Learning for Cyber Security

View full text Add to dashboard Cite

The explosive growth and diversity of machine learning applications motivate a fundamental rethinking of learning with mobile and edge devices. How can we address diverse/disparate client goals and learn with scarce heterogeneous data? While federated learning aims to address these issues, it has several bottlenecks and challenges hindering a unified solution. On the other hand, large transformer models have been shown to work across a variety of tasks often achieving remarkable few-shot adaptation. This raises the question: Can clients use a single generalpurpose model -rather than custom models for each task -while obeying device and network constraints? In this work, we investigate pretrained transformers (PTF) to achieve these on-device learning goals and thoroughly explore the roles of model size and modularity, where the latter refers to adaptation through modules such as prompts or adapters. Focusing on federated learning, we demonstrate that:(1) Larger scale shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Scale allows clients to run more local SGD epochs which can significantly reduce the number of communication rounds. At the extreme, clients can achieve respectable accuracy fully-locally highlighting the potential of fully-local learning. (2) Modularity, by design, enables >100× less communication in bits. Surprisingly, it also boosts the generalization capability of local adaptation methods and the robustness of smaller PTFs. Finally, it enables clients to solve multiple unrelated tasks simultaneously using a single PTF, whereas full updates are prone to catastrophic forgetting. These insights on scale and modularity motivate a new federated learning approach we call "You Only Load Once" (FedYolo): The clients load a full PTF model once and all future updates are accomplished through communication-efficient modules with limited catastrophic-forgetting, where each task is assigned to its own module.

show abstract

“…This shows its greatest downside -AdapterFusion is trained for one task only. Hu et al (2021) argue that the original adapter bottleneck design (Houlsby et al, 2019) introduces inference latency because the adapters are processed sequentially, whereas large language models (LLMs) rely on hardware parallelism. Their approach, LoRA (Low Rank Approximation) modifies attention weights of query and value projection matrices by introducing trainable low-rank decomposition matrices in parallel to the original computation.…”

Section: Adaptersmentioning

confidence: 99%

“…Finally, model updating due to distribution shift, new data, or business requirements (see Section 4.4) seems most plausible in the setting where prompts are continuous and tuned (Li & Liang, 2021). However, this has downsides, such as difficult optimization, non-monotonic performance change with regard to the number of parameters, and reserving a part of sequence length for adaptation (Hu et al, 2021).…”

Section: B Additional Multi-task Learning Approachesmentioning

confidence: 99%

Speeding Up Transformer Training By Using Dataset Subsampling - An Exploratory Analysis

Torbarina¹,

Mihelčić²,

Šarlija³

et al. 2021

Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Transformer-based models have greatly advanced the progress in the field of the natural language processing and while they achieve state-of-the-art results on a wide range of tasks, they are cumbersome in parameter size. Subsequently, even when pre-trained transformer models are used for fine-tuning on a given task, if the dataset is large, it may still not be feasible to fine-tune the model within a reasonable time. For this reason, we empirically test 8 subsampling methods for reducing the dataset size on text classification task and report the trade-off between metric score and training time. 7 out of 8 methods are simple methods, while the last one is CRAIG, a method for coreset construction for data-efficient model training. We obtain the best result with the CRAIG method, offering an average decrease of 0.03 points in f-score on test set while speeding up the training time on average by 63.93%, relative to the score and time obtained by using the full dataset. Lastly, we show the trade-off between speed and performance for all sampling methods on three different datasets.

show abstract

Numerical Optimizations for Weighted Low-rank Estimation on Language Models

Cited by 24 publications

References 0 publications

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Data Reconstruction from Gradient Updates in Federated Learning

Speeding Up Transformer Training By Using Dataset Subsampling - An Exploratory Analysis

Contact Info

Product

Resources

About