When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.To develop better methods for selecting data, we start by framing dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework thus avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2× compute multiplier over baseline methods.
Following de Verdière-Gitler-Vertigan and Curtis-Ingerman-Morrow, we prove a host of new results on circular planar electrical networks. We first construct a poset EP n of electrical networks with n boundary vertices, and prove that it is graded by number of edges of critical representatives. We then answer various enumerative questions related to EP n , adapting methods of Callan and Stein-Everett. Finally, we study certain positivity phenomena of the response matrices arising from circular planar electrical networks. In doing so, we introduce electrical positroids, extending work of Postnikov, and discuss a naturally arising example of a Laurent phenomenon algebra, as studied by Lam-Pylyavskyy.
In the enterprise email search setting, the same search engine often powers multiple enterprises from various industries: technology, education, manufacturing, etc. However, using the same global ranking model across different enterprises may result in suboptimal search quality, due to the corpora differences and distinct information needs. On the other hand, training an individual ranking model for each enterprise may be infeasible, especially for smaller institutions with limited data. To address this data challenge, in this paper we propose a domain adaptation approach that fine-tunes the global model to each individual enterprise. In particular, we propose a novel application of the Maximum Mean Discrepancy (MMD) approach to information retrieval, which attempts to bridge the gap between the global data distribution and the data distribution for a given individual enterprise. We conduct a comprehensive set of experiments on a large-scale email search engine, and demonstrate that the MMD approach consistently improves the search quality for multiple individual domains, both in comparison to the global ranking model, as well as several competitive domain adaptation baselines including adversarial learning methods.
For an elliptic curve E/Q, we define an extremal prime for E to be a prime p of good reduction such that the trace of Frobenius of E at p is ± 2 √ p , i.e., maximal or minimal in the Hasse interval. Conditional on the Riemann Hypothesis for certain Hecke L-functions, we prove that if End(E) = O K , where K is an imaginary quadratic field of discriminant = −3, −4, then the number of extremal primes ≤ X for E is asymptotic to X 3/4 / log X. We give heuristics for related conjectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.